The UW Oshkosh Computer Science Tutorial Series: Unicode Tutorial

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-16, both the encoding form and the encoding scheme.

Recall that UTF-16 is a variable-width encoding: while every single code unit is 16 bits long (hence the name UTF-16), some code points are encoded as a single code unit while others are encoded as a sequence of two code units.

More precisely, each valid code point that fits within 16 bits, namely each one in the BMP, simply gets encoded as a 16-bit code unit that is identical to the binary representation of numerical code point itself. For example, the code point U+00AB12 gets encoded by the single code unit 0xAB12.

All supplementary characters (i.e., those in planes 1 through 16) have code points that do not fit within 16 bits. Each supplementary character gets encoded as a sequence of two 16-bit code units called a surrogate pair.

Each surrogate pair comprises a high (or leading) surrogate and a low (or trailing) surrogate. Since each surrogate is a 16-bit code unit, it must belong to the range of the BMP. Indeed, two specific sub-ranges of the BMP were reserved to encode surrogates, namely:

the range 0xD800 to 0xDBFF is reserved for high surrogates, and
the range 0xDC00 to 0xDFFF is reserved for low surrogates.

It is unfortunate that all of the values in the high-surrogate range are smaller than all of the values in the low-surrogate range. The "high" and "low" prefixes refer not to the values of the surrogates but to their position in the pair, namely <high surrogate, low surrogate>, where "high" and "low" mean "most significant" and "least significant", respectively.

Note that each one of the surrogate ranges contains exactly \(2^{10} =1,024\) code points. Therefore, the set of surrogate pairs can encode exactly \(2^{10}\cdot 2^{10} = 2^{20} = 1,048,576\) code points, which is precisely the number of supplementary characters (16 planes of 65,536 code points each together add up to \(2^4\cdot 2^{16} = 2^{20}\) code points).

Because Unicode is designed to avoid overlap between the sequences of code units of different characters, the code units between 0xD800 and 0xDFFF (i.e., those reserved for either half of a surrogate pair) are NOT allowed, on their own, to represent any characters.

As a result, since 2,048 code points are reserved for surrogate values, the actual number of available code points in the range 0x000000-0x10FFF is not 1,114,112 but rather 1,114,112 - 2,048 = 1,112,064. Of course, since all Unicode encodings cover exactly the same set of characters, the range of code points between 0xD800 and 0xDFFF are also forbidden (on their own) in UTF-8 and UTF-32.

Because of the non-overlap constraint imposed on distinct code unit sequences, a well-formed UTF-16-encoded text may not start with a low surrogate. In fact, if any low surrogate appears in UTF-16-encoded text, it must be preceded by a high surrogate. In other words, it may not be preceded by another low surrogate, by any non-surrogate code unit, nor be the first character in the text.

Similarly, in well-formed UTF-16-encoded text, any high surrogate must be followed by a low surrogate. It may not appear at the end of the encoded text nor be followed by another high surrogate or a non-surrogate.

For detailed discussions of the UTF-16 encoding and decoding algorithms, see the following two parts of this tutorial.

Note that our discussion so far has been limited to the UTF-16 encoding form that assigns a sequence of one or more 16-bit code units to each code point.

We now turn our attention to the UTF-16 encoding schemes, which produce byte sequences from code unit sequences (see a detailed discussion of this distinction).

Each UTF-16 encoding scheme specifies a byte order for the two bytes within each code unit, while the order of the code units themselves within their sequence is not affected.

The following examples illustrate the outcome of the full encoding process using the three UTF-16, UTF-16BE, and UTF-16LE encoding schemes.

code point	UTF-16 code unit sequence	UTF-16BE hex bytes	UTF-16LE hex bytes	UTF-16 hex bytes
U+0041	0x0041	00 41	41 00	00 41, FE FF 00 41, or FF FE 41 00
U+D81A	Invalid code point (falls in the high surrogate range)
U+DE83	Invalid code point (falls in the low surrogate range)
U+E012	0xE012	E0 12	12 E0	E0 12, FE FF E0 12, or FF FE 12 E0
U+010302	0xD800 0xDF02	D8 00 DF 02	00 D8 02 DF	D8 00 DF 02, FE FF D8 00 DF 02, or FF FE 00 D8 02 DF
U+10FDCB	0xDBFF 0xDDCB	DB FF DD CB	FF DD CB DD	DB FF DD CB, FE FF DB FF DD CB, or FF FE FF DF CB DD
U+10FFFF	0xDBFF 0xDFFF	DB FF DF FF	FF DB FF DF	DB FF DF FF, FE FF DB FF DF FF, or FF FE FF DB FF DF
U+12345A	Invalid code point (larger than 0x10FFFF)

If you need to review the byte order mark (BOM), namely the byte sequences 0xFEFF and 0xFFFE used in the UTF-16 encoding scheme above, refer to this earlier part of the tutorial.

For complete details on the algorithm used to encode a code point into a sequence of code units using the UTF-16 encoding form, refer to the next part of this tutorial.

This blog post is based on Sections 2.5, 3.8, 3.9 and 5.4 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

The UW Oshkosh Computer Science Tutorial Series

Monday, May 30, 2016

Unicode Tutorial - Part 5: UTF-16

No comments:

Post a Comment