The UW Oshkosh Computer Science Tutorial Series: UTF-16

Showing posts with label UTF-16. Show all posts

Wednesday, June 1, 2016

Unicode Tutorial - Part 7: Visualization of the UTF-16 Decoding Algorithm

UTF-16 Decoding Algorithm

Below is an interactive visualization of the decoding algorithm that takes as input a byte sequence representing a single Unicode character in the UTF-16, UTF-16BE, or UTF-16LE encoding scheme and converts it to the corresponding Unicode code point in hex notation.

Step 1:	Enter a byte sequence for ONE Unicode character in hex notation (for example, D8 12 DC 34 or FFFE3412):
0x
	Pick one of the following encoding schemes:
	UTF-16BE UTF-16LE UTF-16

Step 5:	Finally, add 0x10000 to the hex number above to get the final code point, namely:
U+

The algorithm visualized in this post is described in plain text in this RFC.

Monday, May 30, 2016

Unicode Tutorial - Part 5: UTF-16

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-16, both the encoding form and the encoding scheme.

Recall that UTF-16 is a variable-width encoding: while every single code unit is 16 bits long (hence the name UTF-16), some code points are encoded as a single code unit while others are encoded as a sequence of two code units.

More precisely, each valid code point that fits within 16 bits, namely each one in the BMP, simply gets encoded as a 16-bit code unit that is identical to the binary representation of numerical code point itself. For example, the code point U+00AB12 gets encoded by the single code unit 0xAB12.

All supplementary characters (i.e., those in planes 1 through 16) have code points that do not fit within 16 bits. Each supplementary character gets encoded as a sequence of two 16-bit code units called a surrogate pair.

Each surrogate pair comprises a high (or leading) surrogate and a low (or trailing) surrogate. Since each surrogate is a 16-bit code unit, it must belong to the range of the BMP. Indeed, two specific sub-ranges of the BMP were reserved to encode surrogates, namely:

the range 0xD800 to 0xDBFF is reserved for high surrogates, and
the range 0xDC00 to 0xDFFF is reserved for low surrogates.

It is unfortunate that all of the values in the high-surrogate range are smaller than all of the values in the low-surrogate range. The "high" and "low" prefixes refer not to the values of the surrogates but to their position in the pair, namely <high surrogate, low surrogate>, where "high" and "low" mean "most significant" and "least significant", respectively.

Note that each one of the surrogate ranges contains exactly \(2^{10} =1,024\) code points. Therefore, the set of surrogate pairs can encode exactly \(2^{10}\cdot 2^{10} = 2^{20} = 1,048,576\) code points, which is precisely the number of supplementary characters (16 planes of 65,536 code points each together add up to \(2^4\cdot 2^{16} = 2^{20}\) code points).

Because Unicode is designed to avoid overlap between the sequences of code units of different characters, the code units between 0xD800 and 0xDFFF (i.e., those reserved for either half of a surrogate pair) are NOT allowed, on their own, to represent any characters.

As a result, since 2,048 code points are reserved for surrogate values, the actual number of available code points in the range 0x000000-0x10FFF is not 1,114,112 but rather 1,114,112 - 2,048 = 1,112,064. Of course, since all Unicode encodings cover exactly the same set of characters, the range of code points between 0xD800 and 0xDFFF are also forbidden (on their own) in UTF-8 and UTF-32.

Because of the non-overlap constraint imposed on distinct code unit sequences, a well-formed UTF-16-encoded text may not start with a low surrogate. In fact, if any low surrogate appears in UTF-16-encoded text, it must be preceded by a high surrogate. In other words, it may not be preceded by another low surrogate, by any non-surrogate code unit, nor be the first character in the text.

Similarly, in well-formed UTF-16-encoded text, any high surrogate must be followed by a low surrogate. It may not appear at the end of the encoded text nor be followed by another high surrogate or a non-surrogate.

For detailed discussions of the UTF-16 encoding and decoding algorithms, see the following two parts of this tutorial.

Note that our discussion so far has been limited to the UTF-16 encoding form that assigns a sequence of one or more 16-bit code units to each code point.

We now turn our attention to the UTF-16 encoding schemes, which produce byte sequences from code unit sequences (see a detailed discussion of this distinction).

Each UTF-16 encoding scheme specifies a byte order for the two bytes within each code unit, while the order of the code units themselves within their sequence is not affected.

The following examples illustrate the outcome of the full encoding process using the three UTF-16, UTF-16BE, and UTF-16LE encoding schemes.

code point	UTF-16 code unit sequence	UTF-16BE hex bytes	UTF-16LE hex bytes	UTF-16 hex bytes
U+0041	0x0041	00 41	41 00	00 41, FE FF 00 41, or FF FE 41 00
U+D81A	Invalid code point (falls in the high surrogate range)
U+DE83	Invalid code point (falls in the low surrogate range)
U+E012	0xE012	E0 12	12 E0	E0 12, FE FF E0 12, or FF FE 12 E0
U+010302	0xD800 0xDF02	D8 00 DF 02	00 D8 02 DF	D8 00 DF 02, FE FF D8 00 DF 02, or FF FE 00 D8 02 DF
U+10FDCB	0xDBFF 0xDDCB	DB FF DD CB	FF DD CB DD	DB FF DD CB, FE FF DB FF DD CB, or FF FE FF DF CB DD
U+10FFFF	0xDBFF 0xDFFF	DB FF DF FF	FF DB FF DF	DB FF DF FF, FE FF DB FF DF FF, or FF FE FF DB FF DF
U+12345A	Invalid code point (larger than 0x10FFFF)

If you need to review the byte order mark (BOM), namely the byte sequences 0xFEFF and 0xFFFE used in the UTF-16 encoding scheme above, refer to this earlier part of the tutorial.

For complete details on the algorithm used to encode a code point into a sequence of code units using the UTF-16 encoding form, refer to the next part of this tutorial.

This blog post is based on Sections 2.5, 3.8, 3.9 and 5.4 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

Tuesday, May 24, 2016

Unicode Tutorial - Part 3: Character Encoding Form versus Character Encoding Scheme

Let's start by reviewing the general framework for encoding schemes described in this post:

bit pattern	\( \longleftrightarrow \)	code point	\( \longleftrightarrow \)	(abstract)
(one or more code units)				character

in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding, as discussed in Part 2 of this tutorial. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In the previous post, we discussed the mapping from characters to code points. In this post, we focus on the leftmost mapping, that is, the way bit patterns are assigned to code points.

Since Unicode is an international standard for data interchange over computer networks, the code points, which are numbers, must be encoded as fixed-size bit patterns (e.g, octets or 8-bit bytes, 32-bit words, etc.) for computers to process, transmit and store. These fixed-size bit patterns are called code units and this fixed number of bits is called the code unit size (see this post for a refresher).

Each character encoding model is therefore characterized by 1) its code unit size and 2) the way each code point is encoded into (or mapped to) a sequence of one or more code units. In the Unicode standard, this mapping is called a "character encoding form" (the "character encoding scheme" is different and will be defined below).

Definition 1: A character encoding form (or CEF) is a mapping from the code space (i.e., the set of code points) to the set of sequences of code units.

In other words, in the Unicode standard, a CEF specifies how each integer that is a valid code point is represented as a bit string, which is the concatenation of one or more fixed-size code units.

The Unicode standard defines three distinct CEFs that use 8-bit, 16- bit, and 32-bit units and are named UTF-8, UTF-16, and UTF-32, respectively. The UTF acronym is a carryover from earlier terminology and stands for Unicode (or UCS) Transformation Format. Each one of UTF-8, UTF-16 and UTF-32 will be described in its own post in upcoming parts of this tutorial.

Note that each one of these three CEFs is a legitimate encoding of all Unicode characters. They differ not only in the bit strings that they assign to the code points and in their code unit size, but also in the number of code units in the sequences that are assigned to the code point. This leads us to define two kinds of CEFs.

Definition 2: A CEF has fixed width if and only if it maps all code points in the code space to (code unit) sequences of the same length. Otherwise, the CEF is said to have variable width.

For example, UTF-32 has fixed width because it maps all code points to a single 32-bit code unit. Similarly ISO/IEC 8859-1 has fixed width because it maps all code points in its repertoire to a single 8-bit code unit.

In contrast, UTF-16 has variable width because it maps some code points to one 16-bit code unit and others to a sequence of two 16-bit code units. Similarly, UTF-8 has variable width because the sequences that are mapped to code points may contain 1, 2, 3, or 4 8-bit code units.

Character encoding forms define a one-to-one mapping between code points and code unit sequences. As such, they are sufficient when dealing with characters stored in memory of any given computer. After all, a code point is just an integer and a code unit sequence is just a bit string. Any computer can handle those data types easily.

However, things are a bit more complicated when textual data must be stored in a file on disk or sent to another computer over a network. Since the basic unit of storage or communication is the octet or 8-bit byte (thereafter referred to simply as a byte), integers and bit strings that do not fit in one byte must be handled carefully. In particular, these bit strings must be serialized, that is, broken down into a sequence of bytes.

Unfortunately, different computers (more specifically, processor architectures) may serialize multi-byte bit strings in different ways. For example, Little Endian architectures serialize bit strings from least significant to most significant (i.e., they put the "little end" first), while Big Endian architectures serialize bit strings from most significant to least significant (i.e., they put the "big end" first).

For example, the (made up) 3-byte sequence 0x2468AC would be serialized as 24 68 AC in Big Endian and as AC 68 24 in Little Endian.

This difference does not matter for operations on code unit sequences that take place within the memory of a given computer, since each computer consistently uses the same byte order for all of its internal operations.

However, the byte order becomes important when encoded text (i.e., the code unit sequences of its characters) is communicated through data streams that connect to file storage or network communication links. To avoid corruption of the encoded text, the computers involved must agree on the byte serialization scheme.

This is where character encoding schemes come into play.

Definition 3: A character encoding scheme (or CES) specifies both a CEF and the way that the code unit sequences of the CEF are serialized into bytes.

The Unicode standard specifies 3 character encoding forms (as discussed above) and 7 character encoding schemes, as follows:

UTF-32BE: UTF-32 plus Big Endian serialization
UTF-32LE: UTF-32 plus Little Endian serialization
UTF-32: Uses Big Endian by default (or a byte order mark, see below). In this case, the CEF and the CES have the same name, which must be disambiguated by the context.
UTF-16BE: UTF-16 plus Big Endian serialization
UTF-16LE: UTF-16 plus Little Endian serialization
UTF-16: Similar case to UTF-32 in the third bullet point above
UTF-8: Since each code point maps to a sequence of code units that are bytes already, there is no need to serialize the code units. The bytes are serialized in the same order as they appear in the code unit sequence.

Let's now consider some examples. keeping in mind that two hex characters (as used in the following examples) together correspond to one byte.

Example 1: In the UTF-8 scheme, the UTF-8 (CEF) code unit sequence

4D D0 B0 E4 BA 8C F0 90 8C 82

is directly serialized as:

4D D0 B0 E4 BA 8C F0 90 8C 82

Example 2: In the UTF-16BE scheme, the UTF-16 code unit sequence

004D 0430 4E8C D800 DF02

is serialized as:

00 4D 04 30 4E 8C D8 00 DF 02

Example 3: In the UTF-16LE scheme, the UTF-16 code unit sequence

004D 0430 4E8C D800 DF02

is serialized as:

4D 00 30 04 8C 4E 00 D8 02 DF

Example 4: In the UTF-32BE scheme, the UTF-32 code unit sequence

0000004D 00000430 00004E8C 00010302

is serialized as:

00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02

Example 5: In the UTF-32LE scheme, the UTF-32 code unit sequence

0000004D 00000430 00004E8C 00010302

is serialized as:

4D 00 00 00 30 04 00 00 8C 4E 00 00 02 03 01 00

There is one more feature of encoding schemes that I'd like to mention. How do Unicode text processors determine the endianness of a piece of Unicode text when the byte order is not specified, as in the case of the UTF-16 and UTF-32 encoding schemes?

Definition 4: The Unicode standard defines a special character called the byte order mark or BOM that appears at the head of a data stream in order to unambiguously signal the byte order of the code units that follow.

For example, in the UTF-16 encoding scheme, an initial byte sequence "FE FF" is interpreted as a byte order mark indicating Big Endian order. In contrast, the initial byte sequence "FF FE" indicates Little Endian order.

Note that the BOM is not part of the contents of the file or the data in the packet. U+FEFF and U+FFFE were chosen for this role because they are highly unlikely to represent encoded text. In fact, U+FFFE is a noncharacter and U+FEFF is the format character named ZERO WIDTH NO-BREAK SPACE (see Part 2 of this tutorial).

Let's conclude with an example involving the BOM.

Example 6: Consider the UTF-16 code unit sequence: D800 DF02

In UTF-16BE, it is serialized as: D8 00 DF 02

In UTF-16LE, it is serialized as: 00 D8 02 DF

But in UTF-16, it could be serialized as:

FE FF D8 00 DF 02 (Big Endian with BOM),

FF FE 00 D8 02 DF (Little Endian with BOM), or

D8 00 DF 02 (Big Endian, no BOM, i.e., the default).

To summarize this post, the mapping from code points to bit strings show above as:

bit pattern	\( \longleftrightarrow \)	code point
(one or more code units)

in Unicode is really a two-step mapping as follows:

byte sequence

\( \stackrel{byte\ ordering}{\longleftrightarrow} \)

code unit sequence

\(\stackrel{CEF}{\longleftrightarrow} \)

code point

Having covered the 7 character encoding schemes in sufficient detail, we will turn our attention back to the 3 character encoding forms in the next three parts of this tutorial.

This blog post is based on Sections 2.5, 2.6, 3.9 and 3.10 of The Unicode® Standard Version 8.0 – Core Specification (August 2015) and Unicode Technical Report #17 (2008).