Unicode
Unicode is the universal character encoding scheme with a code space divided into [see page 17, 17 planes]. The first 127 code points of unicode are the same as ascii, meaning any valid ASCII files are valid unicode files as well.
Design Principles
it has 10 [see page 5, design principles].
Principle | Statement |
---|---|
[see page 6, Universality] | Encodes a single set of chars for all characters for worldwide use. |
[see page 7, Efficiency] | Allow efficient implementations (no shift state). |
[see page 8, Chars, not Glyphs] | Unicode encodes characters not glyphs. |
[see page 10, Semantics] | Character property tables are provided for use in parsing and sorting. |
[see page 10, Plain Text] | Is a pure sequence of character codes, styling must be supplied separately. |
[see page 11, Logical Order] | Order of stored text matches the order of keyboard input. Display order may be different. |
[see page 11, Unification] | Avoid duplication of the same character across languages (eg. punctuation/fullstop). |
[see page 12, Dynamic Composition] | Separate codes (eg. base characters and accents) can be composed. |
[see page 12, Stability] | Certain parts of the spec are guaranteed stable between versions for comparability. |
[see page 12, Convertability] | Can convert chars between unicode and other competing standards (eg. ASCII). |
Encoding Model
The encoding model for unicode consists of:
- An abstract
character repertoire
. The comprehensive list of characters in the standard. - Their mapping to integers (code points). An abstract character is assigned to a code point and is then treated as an encoded character.
- Their encoding forms. I.E. the sequence of bytes (serialisation) stored in the file?
After defining a mapping from characters to integers, we further [see page 21, define]:
For [see page 26, utf-8] upto 4 bytes can be used (depending on the first few bits of the sequence) to represent all unicode code points.
Surrogate Pairs
Unicode also supports the concept of [see page 24, surrogate pairs] to allow custom code-points. Two 16 bit values in the surrogate pair are represented as a single character with:
- The first value in the pair being the high surrogate
- The last value in the pair being the low surrogate
Pairs are defined the range \(U+D800\) to \(U+DFFF\), causing 2048 codes to not be assignable by unicode.