Monday, May 30, 2016

Unicode Tutorial - Part 5: UTF-16

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-16, both the encoding form and the encoding scheme.

Recall that UTF-16 is a variable-width encoding: while every single code unit is 16 bits long (hence the name UTF-16), some code points are encoded as a single code unit while others are encoded as a sequence of two code units.

More precisely, each valid code point that fits within 16 bits, namely each one in the BMP, simply gets encoded as a 16-bit code unit that is identical to the binary representation of numerical code point itself. For example, the code point U+00AB12 gets encoded by the single code unit 0xAB12.

All supplementary characters (i.e., those in planes 1 through 16) have code points that do not fit within 16 bits. Each supplementary character gets encoded as a sequence of two 16-bit code units called a surrogate pair.

Each surrogate pair comprises a high (or leading) surrogate and a low (or trailing) surrogate. Since each surrogate is a 16-bit code unit, it must belong to the range of the BMP. Indeed, two specific sub-ranges of the BMP were reserved to encode surrogates, namely:
  • the range 0xD800 to 0xDBFF is reserved for high surrogates, and
  • the range 0xDC00 to 0xDFFF is reserved for low surrogates.

It is unfortunate that all of the values in the high-surrogate range are smaller than all of the values in the low-surrogate range. The "high" and "low" prefixes refer not to the values of the surrogates but to their position in the pair, namely <high surrogate, low surrogate>, where "high" and "low" mean "most significant" and "least significant", respectively.

Note that each one of the surrogate ranges contains exactly \(2^{10} =1,024\) code points. Therefore, the set of surrogate pairs can encode exactly \(2^{10}\cdot 2^{10} = 2^{20} = 1,048,576\) code points, which is precisely the number of supplementary characters (16 planes of 65,536 code points each together add up to \(2^4\cdot 2^{16} = 2^{20}\) code points).

Because Unicode is designed to avoid overlap between the sequences of code units of different characters, the code units between 0xD800 and 0xDFFF (i.e., those reserved for either half of a surrogate pair) are NOT allowed, on their own, to represent any characters.

As a result, since 2,048 code points are reserved for surrogate values, the actual number of available code points in the range 0x000000-0x10FFF is not 1,114,112 but rather 1,114,112 - 2,048 = 1,112,064. Of course, since all Unicode encodings cover exactly the same set of characters, the range of code points between 0xD800 and 0xDFFF are also forbidden (on their own) in UTF-8 and UTF-32.

Because of the non-overlap constraint imposed on distinct code unit sequences, a well-formed UTF-16-encoded text may not start with a low surrogate. In fact, if any low surrogate appears in UTF-16-encoded text, it must be preceded by a high surrogate. In other words, it may not be preceded by another low surrogate, by any non-surrogate code unit, nor be the first character in the text.

Similarly, in well-formed UTF-16-encoded text, any high surrogate must be followed by a low surrogate. It may not appear at the end of the encoded text nor be followed by another high surrogate or a non-surrogate.

For detailed discussions of the UTF-16 encoding and decoding algorithms, see the following two parts of this tutorial.

Note that our discussion so far has been limited to the UTF-16 encoding form that assigns a sequence of one or more 16-bit code units to each code point.

We now turn our attention to the UTF-16 encoding schemes, which produce byte sequences from code unit sequences (see a detailed discussion of this distinction).

Each UTF-16 encoding scheme specifies a byte order for the two bytes within each code unit, while the order of the code units themselves within their sequence is not affected.

The following examples illustrate the outcome of the full encoding process using the three UTF-16, UTF-16BE, and UTF-16LE encoding schemes.

code point UTF-16
code unit
sequence
UTF-16BE
hex bytes
UTF-16LE
hex bytes
UTF-16
hex bytes
U+0041 0x0041 00 41 41 00 00 41,
FE FF 00 41, or
FF FE 41 00
U+D81A Invalid code point (falls in the high surrogate range)
U+DE83 Invalid code point (falls in the low surrogate range)
U+E012 0xE012 E0 12 12 E0 E0 12,
FE FF E0 12, or
FF FE 12 E0
U+010302 0xD800 0xDF02 D8 00 DF 02 00 D8 02 DF D8 00 DF 02,
FE FF D8 00 DF 02, or
FF FE 00 D8 02 DF
U+10FDCB 0xDBFF 0xDDCB DB FF DD CB FF DD CB DD DB FF DD CB,
FE FF DB FF DD CB, or
FF FE FF DF CB DD
U+10FFFF 0xDBFF 0xDFFF DB FF DF FF FF DB FF DF DB FF DF FF,
FE FF DB FF DF FF, or
FF FE FF DB FF DF
U+12345A Invalid code point (larger than 0x10FFFF)

If you need to review the byte order mark (BOM), namely the byte sequences 0xFEFF and 0xFFFE used in the UTF-16 encoding scheme above, refer to this earlier part of the tutorial.

For complete details on the algorithm used to encode a code point into a sequence of code units using the UTF-16 encoding form, refer to the next part of this tutorial.


This blog post is based on Sections 2.5, 3.8, 3.9 and 5.4 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

No comments:

Post a Comment