UTF-16 Encoding Algorithm
Below is an interactive visualization of the encoding algorithm that takes as input a Unicode code point in hex notation and converts it to a byte string according to the UTF-16, UTF-16BE, and UTF-16LE encoding schemes.
Step 1: | Enter a code point between 0 and 10FFFF in hex notation |
---|---|
0x |
While the visualization above works for all code points, it is interesting to try specific code points to better understand the encoding of supplementary characters, that is, those that are outside the Basic Multilingual Plane (BMP).
For example, the smallest supplementary character has code point 0x10000. which reduces to 0x0 after the subtraction in Step 2. Later in Step 4, inserting the bits 110110 in front of the first half (of all zeros) yields the value 0xD800, which is the smallest possible value for a high surrogate.
Similarly, inserting the bits 110111 in front of the second half (of all zeros) yields the value 0xDC00, which is the smallest possible value for a low surrogate.
Therefore, in UTF-16 encoding, the smallest supplementary character gets assigned the smallest possible (component-wise) surrogate pair.
At the other end of the spectrum, the character with the largest possible code point, namely 0x10FFFF, gets assigned the largest possible (component-wise) surrogate pair, namely:
< 0xDBFF, 0xDFFF >
More generally, there is a one-to-one mapping between the set of supplementary character code points and the set of all surrogate pairs.
In the next post, we will discuss the inverse mapping, that is, from byte sequences to Unicode character sequences.
The algorithm visualized in this post is described in plain text in this RFC.
No comments:
Post a Comment