Skip to content

Commit 3561889

Browse files
committed
Updates
1 parent cb3b7d9 commit 3561889

10 files changed

+353
-238
lines changed
File renamed without changes.

localization.md

-238
This file was deleted.
+83
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Character sets
2+
3+
A character set defines 3 things
4+
5+
1. a set of glyphs
6+
2. an integer representation of each glyph
7+
3. a way of encoding those integers as bytes
8+
9+
ASCII is a simple character set
10+
11+
* defines a set of glyphs
12+
* defines the integer that each glyph is represented by
13+
* defines how to represent that integer in bytes
14+
15+
### Collation
16+
17+
* A character set needs some rules about how to compare characters in the set
18+
e.g. for sorting and string comparisons e.g. less than, equal, greater than
19+
etc.
20+
* This set of rules is called a collation
21+
* A character set always has at least one collations
22+
* A character set may have more than one collation
23+
* => A collation is a property of a character set and has no meaning outside
24+
that character set.
25+
26+
Examples of collations
27+
28+
* binary
29+
* compare characters based on the integers they are encoded as
30+
* very simplistic
31+
* case-insensitve
32+
* treat upper and lower case letters as equal
33+
* case sensitive
34+
35+
Common encoding names have a pattern:
36+
37+
<character set name>_<language_name>_<suffix>
38+
39+
where `<suffix>` can be:
40+
41+
cs => case sensitivk
42+
ci => case insensitive
43+
bin => binary
44+
45+
* A collation is tied to a character set encoding
46+
* the same collation name may exist in multiple character sets
47+
* There are 3 collations available on all platforms
48+
1. `default`
49+
2. `C`
50+
3. `POSIX`
51+
* C and POSIX have identical behaviours - they both specify "traditional C" behaviour
52+
* only uppercase A to Z considered "letters"
53+
* sorting done strictly by character code byte value
54+
* postgres considers them different collations so you can't compare a a
55+
column with a "C" locale and a column with a "POSIX" locale.
56+
57+
# Unicode includes a character set, collations and encodings
58+
59+
* Unicode includes the "Universal character set standard"
60+
* includes 120k+ characters in almost all the worlds languages
61+
* Unicode v8.0 was released July 2015
62+
* Unicode assigns each glyph a "code point" which is a positive integer
63+
* Each code point needs to be represented as a binary number - _how_ that
64+
happens is specified by the "encoding"
65+
* Unicode has a number of possible encodings e.g. UTF-8, UTF-16
66+
* UTF-8 is an _encoding_ i.e. a way of turning a list of codepoints integers into bytes
67+
* UTF-8 uses 1 byte for a codepoint if it can and 2-4 bytes if necessary
68+
* => it is a variable length encoding: 1 to 4 bytes
69+
* It is a superset of ASCII
70+
71+
* Unicode defines a _customisable_ collation algorithm
72+
* assigns a tuple of weighting floats to each glyph
73+
* it has a table with a default collation for _all_ unicode chars
74+
* the Default Unicode Collation Element Table (DUCET).
75+
* customizable for different languages
76+
77+
# Examples of character sets
78+
79+
* US-ASCII
80+
* ISO-8859-1 aka latin1
81+
* Unicode
82+
83+
Others http://www.iana.org/assignments/character-sets/character-sets.xhtml

0 commit comments

Comments
 (0)