Unicode

Tags: text-processing comp-sci

Unicode is the universal character encoding scheme with a code space divided into [see page 17, 17 planes]. The first 127 code points of unicode are the same as ascii, meaning any valid ASCII files are valid unicode files as well.

Design Principles

it has 10 [see page 5, design principles].

Principle	Statement
[see page 6, Universality]	Encodes a single set of chars for all characters for worldwide use.
[see page 7, Efficiency]	Allow efficient implementations (no shift state).
[see page 8, Chars, not Glyphs]	Unicode encodes characters not glyphs.
[see page 10, Semantics]	Character property tables are provided for use in parsing and sorting.
[see page 10, Plain Text]	Is a pure sequence of character codes, styling must be supplied separately.
[see page 11, Logical Order]	Order of stored text matches the order of keyboard input. Display order may be different.
[see page 11, Unification]	Avoid duplication of the same character across languages (eg. punctuation/fullstop).
[see page 12, Dynamic Composition]	Separate codes (eg. base characters and accents) can be composed.
[see page 12, Stability]	Certain parts of the spec are guaranteed stable between versions for comparability.
[see page 12, Convertability]	Can convert chars between unicode and other competing standards (eg. ASCII).

Encoding Model

The encoding model for unicode consists of:

An abstract character repertoire. The comprehensive list of characters in the standard.
Their mapping to integers (code points). An abstract character is assigned to a code point and is then treated as an encoded character.
Their encoding forms. I.E. the sequence of bytes (serialisation) stored in the file?

After defining a mapping from characters to integers, we further [see page 21, define]:

- **Character Encoding Form** - a mapping from a set of integers to a set of sequences of code units of specified width (e.g. 8-bit bytes). - **Character Encoding Scheme** - a mapping from a set of sequences of code units to a serialized sequence of bytes

For [see page 26, utf-8] upto 4 bytes can be used (depending on the first few bits of the sequence) to represent all unicode code points.

Surrogate Pairs

Unicode also supports the concept of [see page 24, surrogate pairs] to allow custom code-points. Two 16 bit values in the surrogate pair are represented as a single character with:

The first value in the pair being the high surrogate
The last value in the pair being the low surrogate

Pairs are defined the range \(U+D800\) to \(U+DFFF\), causing 2048 codes to not be assignable by unicode.