|
| 1 | +# Character sets |
| 2 | + |
| 3 | +A character set defines 3 things |
| 4 | + |
| 5 | +1. a set of glyphs |
| 6 | +2. an integer representation of each glyph |
| 7 | +3. a way of encoding those integers as bytes |
| 8 | + |
| 9 | +ASCII is a simple character set |
| 10 | + |
| 11 | +* defines a set of glyphs |
| 12 | +* defines the integer that each glyph is represented by |
| 13 | +* defines how to represent that integer in bytes |
| 14 | + |
| 15 | +### Collation |
| 16 | + |
| 17 | +* A character set needs some rules about how to compare characters in the set |
| 18 | + e.g. for sorting and string comparisons e.g. less than, equal, greater than |
| 19 | + etc. |
| 20 | +* This set of rules is called a collation |
| 21 | +* A character set always has at least one collations |
| 22 | +* A character set may have more than one collation |
| 23 | +* => A collation is a property of a character set and has no meaning outside |
| 24 | + that character set. |
| 25 | + |
| 26 | +Examples of collations |
| 27 | + |
| 28 | +* binary |
| 29 | + * compare characters based on the integers they are encoded as |
| 30 | + * very simplistic |
| 31 | +* case-insensitve |
| 32 | + * treat upper and lower case letters as equal |
| 33 | +* case sensitive |
| 34 | + |
| 35 | +Common encoding names have a pattern: |
| 36 | + |
| 37 | + <character set name>_<language_name>_<suffix> |
| 38 | + |
| 39 | +where `<suffix>` can be: |
| 40 | + |
| 41 | + cs => case sensitivk |
| 42 | + ci => case insensitive |
| 43 | + bin => binary |
| 44 | + |
| 45 | +* A collation is tied to a character set encoding |
| 46 | + * the same collation name may exist in multiple character sets |
| 47 | +* There are 3 collations available on all platforms |
| 48 | + 1. `default` |
| 49 | + 2. `C` |
| 50 | + 3. `POSIX` |
| 51 | +* C and POSIX have identical behaviours - they both specify "traditional C" behaviour |
| 52 | + * only uppercase A to Z considered "letters" |
| 53 | + * sorting done strictly by character code byte value |
| 54 | + * postgres considers them different collations so you can't compare a a |
| 55 | + column with a "C" locale and a column with a "POSIX" locale. |
| 56 | + |
| 57 | +# Unicode includes a character set, collations and encodings |
| 58 | + |
| 59 | +* Unicode includes the "Universal character set standard" |
| 60 | + * includes 120k+ characters in almost all the worlds languages |
| 61 | +* Unicode v8.0 was released July 2015 |
| 62 | +* Unicode assigns each glyph a "code point" which is a positive integer |
| 63 | +* Each code point needs to be represented as a binary number - _how_ that |
| 64 | + happens is specified by the "encoding" |
| 65 | +* Unicode has a number of possible encodings e.g. UTF-8, UTF-16 |
| 66 | +* UTF-8 is an _encoding_ i.e. a way of turning a list of codepoints integers into bytes |
| 67 | +* UTF-8 uses 1 byte for a codepoint if it can and 2-4 bytes if necessary |
| 68 | + * => it is a variable length encoding: 1 to 4 bytes |
| 69 | + * It is a superset of ASCII |
| 70 | + |
| 71 | +* Unicode defines a _customisable_ collation algorithm |
| 72 | + * assigns a tuple of weighting floats to each glyph |
| 73 | + * it has a table with a default collation for _all_ unicode chars |
| 74 | + * the Default Unicode Collation Element Table (DUCET). |
| 75 | + * customizable for different languages |
| 76 | + |
| 77 | +# Examples of character sets |
| 78 | + |
| 79 | +* US-ASCII |
| 80 | +* ISO-8859-1 aka latin1 |
| 81 | +* Unicode |
| 82 | + |
| 83 | +Others http://www.iana.org/assignments/character-sets/character-sets.xhtml |
0 commit comments