The UW Oshkosh Computer Science Tutorial Series

Wednesday, June 1, 2016

Unicode Tutorial - Part 7: Visualization of the UTF-16 Decoding Algorithm

UTF-16 Decoding Algorithm

Below is an interactive visualization of the decoding algorithm that takes as input a byte sequence representing a single Unicode character in the UTF-16, UTF-16BE, or UTF-16LE encoding scheme and converts it to the corresponding Unicode code point in hex notation.

Step 1:	Enter a byte sequence for ONE Unicode character in hex notation (for example, D8 12 DC 34 or FFFE3412):
0x
	Pick one of the following encoding schemes:
	UTF-16BE UTF-16LE UTF-16

Step 5:	Finally, add 0x10000 to the hex number above to get the final code point, namely:
U+

The algorithm visualized in this post is described in plain text in this RFC.

Monday, May 30, 2016

Unicode Tutorial - Part 5: UTF-16

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-16, both the encoding form and the encoding scheme.

Recall that UTF-16 is a variable-width encoding: while every single code unit is 16 bits long (hence the name UTF-16), some code points are encoded as a single code unit while others are encoded as a sequence of two code units.

More precisely, each valid code point that fits within 16 bits, namely each one in the BMP, simply gets encoded as a 16-bit code unit that is identical to the binary representation of numerical code point itself. For example, the code point U+00AB12 gets encoded by the single code unit 0xAB12.

All supplementary characters (i.e., those in planes 1 through 16) have code points that do not fit within 16 bits. Each supplementary character gets encoded as a sequence of two 16-bit code units called a surrogate pair.

Each surrogate pair comprises a high (or leading) surrogate and a low (or trailing) surrogate. Since each surrogate is a 16-bit code unit, it must belong to the range of the BMP. Indeed, two specific sub-ranges of the BMP were reserved to encode surrogates, namely:

the range 0xD800 to 0xDBFF is reserved for high surrogates, and
the range 0xDC00 to 0xDFFF is reserved for low surrogates.

It is unfortunate that all of the values in the high-surrogate range are smaller than all of the values in the low-surrogate range. The "high" and "low" prefixes refer not to the values of the surrogates but to their position in the pair, namely <high surrogate, low surrogate>, where "high" and "low" mean "most significant" and "least significant", respectively.

Note that each one of the surrogate ranges contains exactly $2^{10} =1,024$ code points. Therefore, the set of surrogate pairs can encode exactly $2^{10}\cdot 2^{10} = 2^{20} = 1,048,576$ code points, which is precisely the number of supplementary characters (16 planes of 65,536 code points each together add up to $2^4\cdot 2^{16} = 2^{20}$ code points).

Because Unicode is designed to avoid overlap between the sequences of code units of different characters, the code units between 0xD800 and 0xDFFF (i.e., those reserved for either half of a surrogate pair) are NOT allowed, on their own, to represent any characters.

As a result, since 2,048 code points are reserved for surrogate values, the actual number of available code points in the range 0x000000-0x10FFF is not 1,114,112 but rather 1,114,112 - 2,048 = 1,112,064. Of course, since all Unicode encodings cover exactly the same set of characters, the range of code points between 0xD800 and 0xDFFF are also forbidden (on their own) in UTF-8 and UTF-32.

Because of the non-overlap constraint imposed on distinct code unit sequences, a well-formed UTF-16-encoded text may not start with a low surrogate. In fact, if any low surrogate appears in UTF-16-encoded text, it must be preceded by a high surrogate. In other words, it may not be preceded by another low surrogate, by any non-surrogate code unit, nor be the first character in the text.

Similarly, in well-formed UTF-16-encoded text, any high surrogate must be followed by a low surrogate. It may not appear at the end of the encoded text nor be followed by another high surrogate or a non-surrogate.

For detailed discussions of the UTF-16 encoding and decoding algorithms, see the following two parts of this tutorial.

Note that our discussion so far has been limited to the UTF-16 encoding form that assigns a sequence of one or more 16-bit code units to each code point.

We now turn our attention to the UTF-16 encoding schemes, which produce byte sequences from code unit sequences (see a detailed discussion of this distinction).

Each UTF-16 encoding scheme specifies a byte order for the two bytes within each code unit, while the order of the code units themselves within their sequence is not affected.

The following examples illustrate the outcome of the full encoding process using the three UTF-16, UTF-16BE, and UTF-16LE encoding schemes.

code point	UTF-16 code unit sequence	UTF-16BE hex bytes	UTF-16LE hex bytes	UTF-16 hex bytes
U+0041	0x0041	00 41	41 00	00 41, FE FF 00 41, or FF FE 41 00
U+D81A	Invalid code point (falls in the high surrogate range)
U+DE83	Invalid code point (falls in the low surrogate range)
U+E012	0xE012	E0 12	12 E0	E0 12, FE FF E0 12, or FF FE 12 E0
U+010302	0xD800 0xDF02	D8 00 DF 02	00 D8 02 DF	D8 00 DF 02, FE FF D8 00 DF 02, or FF FE 00 D8 02 DF
U+10FDCB	0xDBFF 0xDDCB	DB FF DD CB	FF DD CB DD	DB FF DD CB, FE FF DB FF DD CB, or FF FE FF DF CB DD
U+10FFFF	0xDBFF 0xDFFF	DB FF DF FF	FF DB FF DF	DB FF DF FF, FE FF DB FF DF FF, or FF FE FF DB FF DF
U+12345A	Invalid code point (larger than 0x10FFFF)

If you need to review the byte order mark (BOM), namely the byte sequences 0xFEFF and 0xFFFE used in the UTF-16 encoding scheme above, refer to this earlier part of the tutorial.

For complete details on the algorithm used to encode a code point into a sequence of code units using the UTF-16 encoding form, refer to the next part of this tutorial.

This blog post is based on Sections 2.5, 3.8, 3.9 and 5.4 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

Saturday, May 28, 2016

Unicode Tutorial - Part 6: Visualization of the UTF-16 Encoding Algorithm

UTF-16 Encoding Algorithm

Below is an interactive visualization of the encoding algorithm that takes as input a Unicode code point in hex notation and converts it to a byte string according to the UTF-16, UTF-16BE, and UTF-16LE encoding schemes.

Step 1:	Enter a code point between 0 and 10FFFF in hex notation
0x

While the visualization above works for all code points, it is interesting to try specific code points to better understand the encoding of supplementary characters, that is, those that are outside the Basic Multilingual Plane (BMP).

For example, the smallest supplementary character has code point 0x10000. which reduces to 0x0 after the subtraction in Step 2. Later in Step 4, inserting the bits 110110 in front of the first half (of all zeros) yields the value 0xD800, which is the smallest possible value for a high surrogate.

Similarly, inserting the bits 110111 in front of the second half (of all zeros) yields the value 0xDC00, which is the smallest possible value for a low surrogate.

Therefore, in UTF-16 encoding, the smallest supplementary character gets assigned the smallest possible (component-wise) surrogate pair.

At the other end of the spectrum, the character with the largest possible code point, namely 0x10FFFF, gets assigned the largest possible (component-wise) surrogate pair, namely:

< 0xDBFF, 0xDFFF >

More generally, there is a one-to-one mapping between the set of supplementary character code points and the set of all surrogate pairs.

In the next post, we will discuss the inverse mapping, that is, from byte sequences to Unicode character sequences.

The algorithm visualized in this post is described in plain text in this RFC.

Friday, May 27, 2016

Unicode Tutorial - Part 4: UTF-32, with Java code samples

00000000 00000000 00000000 01100001

Note that any UTF-32 code unit whose value is greater than 0x0010FFFF is ill-formed (too large). Furthermore, because surrogate values used in UTF-16 (to be discussed in the next part of this tutorial) are not by themselves valid code points, UTF-32 code units in the range 0x00D800 through 0x00DFFF are also ill-formed.

The rest of this post will use the Java programming language to manipulate character encodings in general and UTF-32 more specifically.

The following Java program uses the availableCharsets() and defaultCharset() static methods of the Charset class to list all of the available character encodings as well as the default charset on the machine on which it is executed.

Note that the former method returns a SortedMap data structure. On lines 14-19, I loop through the key-value pairs in this structure to print the canonical name of the Charset (the key of the pair) followed by the list of aliases of the Charset (the value of the pair).

import java.nio.charset.Charset;
import java.util.Map;
import java.util.SortedMap;

class CharacterEncodings {
    public static void main(String[] args) {
        Charset defaultCharset = Charset.defaultCharset();
        String defaultCharsetName = defaultCharset.displayName();
        SortedMap<String,Charset> availableCharsets = Charset.availableCharsets();
        System.out.println("Your JVM supports the following " + 
                           availableCharsets.size() +
                           " character sets,\nwith the default" +
                           " one marked by asterisks:\n");
        for(Map.Entry<String,Charset> entry : availableCharsets.entrySet()) {
            if (defaultCharsetName.equalsIgnoreCase(entry.getValue().displayName())) {
                System.out.print(" ( * * * * * * * * * * * * * * * * * * * * ) ");
            }
            System.out.println(entry.getKey() + " " + entry.getValue().aliases());
        }
    }// main method   
}// CharacterEncodings class

The (truncated) output of this program is shown below. While the UTF-32 encoding is available on my system, it is not the one used by default (e.g., when writing text files). Instead, UTF-8 is the default, which is not surprising since it is a common encoding for text files, for HTML files, etc. In fact, UTF-8 is the dominant encoding on the web.

Your JVM supports the following 169 character sets,
with the default one marked by asterisks:

Big5 [csBig5]
...
TIS-620 [tis620.2533, tis620]
US-ASCII [cp367, ascii7, ISO646-US, 646, csASCII, us, iso_646.irv:1983, 
          ISO_646.irv:1991, IBM367, ASCII, default, ANSI_X3.4-1986, 
          ANSI_X3.4-1968, iso-ir-6]
UTF-16 [utf16, UnicodeBig, UTF_16, unicode]
UTF-16BE [X-UTF-16BE, UTF_16BE, ISO-10646-UCS-2, UnicodeBigUnmarked]
UTF-16LE [UnicodeLittleUnmarked, UTF_16LE, X-UTF-16LE]
UTF-32 [UTF32, UTF_32]
UTF-32BE [X-UTF-32BE, UTF_32BE]
UTF-32LE [X-UTF-32LE, UTF_32LE]
 ( * * * * * * * * * * * * * * * * * * * * ) UTF-8 [unicode-1-1-utf-8, UTF8]
windows-1250 [cp1250, cp5346]
...
x-windows-iso2022jp [windows-iso2022jp]

Note that the output of this program will most likely be different on your machine. For example, the default charset (indicated by asterisks above) is determined during the startup phase of the JVM and typically depends on the locale and the charset of your operating system.

In the following program, I start with an array of 60 bytes (lines 7 through 23). This array contains the Unicode code points of 15 characters (see the comment at the end of each line), with 4 bytes per code point. The main program then converts (or "decodes") this byte array into an array of 15 Unicode characters.

First, on line 35, I create a Charset instance for the UTF-32 encoding using the static forName() method. Then, on line 39, I invoke the decode() instance method that takes in a ByteBuffer (that simply wraps the byte[] object) and returns a CharBuffer containing 15 Java characters, which is then converted to a Java string using its toString() method and finally printed, producing the following output:

Part 1:
©Crêpes à gogo🍴

import java.nio.charset.Charset;
import java.nio.CharBuffer;
import java.nio.ByteBuffer;

class UTF32 {

    static byte[] bytes  = { 
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xA9, // copyright
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x43, // C
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x72, // r
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xEA, // e with ^
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x70, // p
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x65, // e
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x73, // s
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x20, // " "
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xE0, // a with `
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x20, // " " 
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x67, // g
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x6F, // o
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x67, // g
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x6F, // o
      (byte) 0x00, (byte) 0x01, (byte) 0xF3, (byte) 0x74  // knife & fork
    };

    // this method (for fun) uses a classic trick to swap two array 
    // elements without using a temporary variable; note that it only 
    // works if i != j 
    static void swapByteArrayElements(byte[] a, int i, int j) {
        a[i] = (byte) (a[i] + a[j]);
        a[j] = (byte) (a[i] - a[j]);
        a[i] = (byte) (a[i] - a[j]);
    }// swapByteArrayElements method

    public static void main(String[] args) throws Exception {
        Charset utf32charset = Charset.forName("UTF-32");

        // part 1
        String str = 
             utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 1:\n" + str);

        // part 2
        bytes[2] = (byte) 0xFE;
        bytes[3] = (byte) 0xFF;
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 2:\n" + str);
 
        // part 3
        bytes[0] = (byte) 0xFF;
        bytes[1] = (byte) 0xFE;
        bytes[2] = (byte) 0x00;
        bytes[3] = (byte) 0x00;
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 3:\n" + str);
 
        // part 4
        for(int i=4; i<bytes.length-3; i+=4) {
            swapByteArrayElements(bytes,i,i+3);
            swapByteArrayElements(bytes,i+1,i+2);
        }
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 4:\n" + str);
    }// main method   
}// UTF32class

Before discussing the rest of this program, it is important to note that Java internally encodes each Character/char in UTF-16 (which will be discussed in the next part of this tutorial) and each String object as an array of such characters.

Therefore, UTF-16 is the "native" representation of strings in Java. For this reason, when reading text in other representations, Java has to "decode" these external strings that use another encoding into its internal representation.

Conversely, when converting a Java string to another encoding, Java must "encode" the string using an external encoding.

In our case, the external encoding is UTF-32 because I represented the "external" or "non-native" string as a byte array in which four consecutive bytes encode the code point of a single Unicode character.

Back to line 39, utf32charset.decode(ByteBuffer.wrap(bytes)) converts the ByteBuffer (passed as argument to the decode method) to a CharBuffer (returned by the decode method) using the character encoding UTF-32 that the utf32charset object represents.

This is where the conversion or change of encoding takes place. Again, the returned CharBuffer object contains a (native) array of Java chars, each of which is a UTF-16 encoded character. This array can now be converted to a string (with a call to the toString()) method for printing purposes.

This concludes our discussion of Part 1 of this program.

Before we discuss the last three parts of this program, recall that strings in the UTF-32 encoding scheme may or may not begin with a "byte order mark" or BOM. When there is no BOM at the beginning of the input ByteBuffer as in Part 1 above, the byte order of the UTF-32 encoding scheme is assumed to be big-endian by default, according to which the most significant byte of each four-byte integer appears first.

In other words, the four bytes 0x00 0x00 0x00 0xA9 represent the integer 0x000000A9. With the little-endian scheme, these four bytes would represent the integer 0xA9000000 instead.

Part 1 of the program above relies on this default (big-endian) byte ordering. In Part 2 (lines 43 and 44), I replace the first four bytes of the ByteBuffer with the values:

0x00 0x00 0xFE 0xFF

This is the code point for the BOM, namely the special character that denotes the big-endian byte order. Since this character replaces the first character, the copyright symbol is erased. Furthermore, since the BOM, when present, is not part of the string itself (it is like meta-data used to describe the string, namely its byte order), the output for Part 2 is as follows:

Part 2:
Crêpes à gogo🍴

In this output, the characters are displayed correctly because the BOM explicitly states that the byte order is big-endian and the following 4-byte integers are indeed listed in big-endian order.

In Part 3, I first replace (lines 49 through 52) the first four bytes with the values:

0xFF 0xFE 0x00 0x00

Note that these values are simply the same bytes as the BOM above but in reverse order. In fact, they represent the byte order mark for the little-endian scheme. This means that these first four bytes are not going to be part of the decoded string. However, their presence will force the JVM to swap the order of all of the following groups of 4 bytes.

For example, the four bytes of the first actual character in the ByteArray, namely;

0x00 0x00 0x00 0x43

representing the character 'C' will be reordered to:

0x43 0x00 0x00 0x00

and then encoded in UTF-32. However, these reordered four bytes now encode an invalid code point, since the largest possible code point is 0x0010FFFF. This will happen for all 14 characters in the ByteArray (not counting the initial BOM), yielding the following garbage output:

Part 3:
��������������

in which the � character is used by default to replace characters with an invalid code point.

To fix this, in Part 4, I keep the little-endian byte order mark, but reorder the following bytes in groups of 4 (lines 57 through 60). After this code is executed, the ByteBuffer looks like this:

(byte) 0xFF, (byte) 0xFE, (byte) 0x00, (byte) 0x00, // used to be copyright
(byte) 0x43, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be C
(byte) 0x72, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be r
(byte) 0xEA, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be e with ^
etc.

With this preliminary swap completed, the last output produced by this program is thus:

Part 4:
Crêpes à gogo🍴

Note that the copyright symbol has been replaced by the BOM (which is not part of the string and thus not displayed) and that the following characters are correct because now the BOM explicitly states that the byte order is little-endian and the following 4-byte integers are indeed listed in little-endian order.

To summarize:

Part 1 represents the "external" string (that is, the byte array) encoded in UTF-32 with no BOM and thus a big-endian default byte order.
Part 2 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the big-endian byte order on the rest of the bytes.
Part 3 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the little-endian byte order on the rest of the bytes but with the following bytes listed in big-endian order.
Part 4 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the little-endian byte order on the rest of the bytes, which is the correct order.

Therefore, only Part 3 of the code produces unwanted output.

Tuesday, May 24, 2016

Unicode Tutorial - Part 3: Character Encoding Form versus Character Encoding Scheme

Let's start by reviewing the general framework for encoding schemes described in this post:

bit pattern	$ \longleftrightarrow $	code point	$ \longleftrightarrow $	(abstract)
(one or more code units)				character

in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding, as discussed in Part 2 of this tutorial. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In the previous post, we discussed the mapping from characters to code points. In this post, we focus on the leftmost mapping, that is, the way bit patterns are assigned to code points.

Since Unicode is an international standard for data interchange over computer networks, the code points, which are numbers, must be encoded as fixed-size bit patterns (e.g, octets or 8-bit bytes, 32-bit words, etc.) for computers to process, transmit and store. These fixed-size bit patterns are called code units and this fixed number of bits is called the code unit size (see this post for a refresher).

Each character encoding model is therefore characterized by 1) its code unit size and 2) the way each code point is encoded into (or mapped to) a sequence of one or more code units. In the Unicode standard, this mapping is called a "character encoding form" (the "character encoding scheme" is different and will be defined below).

Definition 1: A character encoding form (or CEF) is a mapping from the code space (i.e., the set of code points) to the set of sequences of code units.

In other words, in the Unicode standard, a CEF specifies how each integer that is a valid code point is represented as a bit string, which is the concatenation of one or more fixed-size code units.

The Unicode standard defines three distinct CEFs that use 8-bit, 16- bit, and 32-bit units and are named UTF-8, UTF-16, and UTF-32, respectively. The UTF acronym is a carryover from earlier terminology and stands for Unicode (or UCS) Transformation Format. Each one of UTF-8, UTF-16 and UTF-32 will be described in its own post in upcoming parts of this tutorial.

Note that each one of these three CEFs is a legitimate encoding of all Unicode characters. They differ not only in the bit strings that they assign to the code points and in their code unit size, but also in the number of code units in the sequences that are assigned to the code point. This leads us to define two kinds of CEFs.

Definition 2: A CEF has fixed width if and only if it maps all code points in the code space to (code unit) sequences of the same length. Otherwise, the CEF is said to have variable width.

For example, UTF-32 has fixed width because it maps all code points to a single 32-bit code unit. Similarly ISO/IEC 8859-1 has fixed width because it maps all code points in its repertoire to a single 8-bit code unit.

In contrast, UTF-16 has variable width because it maps some code points to one 16-bit code unit and others to a sequence of two 16-bit code units. Similarly, UTF-8 has variable width because the sequences that are mapped to code points may contain 1, 2, 3, or 4 8-bit code units.

Character encoding forms define a one-to-one mapping between code points and code unit sequences. As such, they are sufficient when dealing with characters stored in memory of any given computer. After all, a code point is just an integer and a code unit sequence is just a bit string. Any computer can handle those data types easily.

However, things are a bit more complicated when textual data must be stored in a file on disk or sent to another computer over a network. Since the basic unit of storage or communication is the octet or 8-bit byte (thereafter referred to simply as a byte), integers and bit strings that do not fit in one byte must be handled carefully. In particular, these bit strings must be serialized, that is, broken down into a sequence of bytes.

Unfortunately, different computers (more specifically, processor architectures) may serialize multi-byte bit strings in different ways. For example, Little Endian architectures serialize bit strings from least significant to most significant (i.e., they put the "little end" first), while Big Endian architectures serialize bit strings from most significant to least significant (i.e., they put the "big end" first).

For example, the (made up) 3-byte sequence 0x2468AC would be serialized as 24 68 AC in Big Endian and as AC 68 24 in Little Endian.

This difference does not matter for operations on code unit sequences that take place within the memory of a given computer, since each computer consistently uses the same byte order for all of its internal operations.

However, the byte order becomes important when encoded text (i.e., the code unit sequences of its characters) is communicated through data streams that connect to file storage or network communication links. To avoid corruption of the encoded text, the computers involved must agree on the byte serialization scheme.

This is where character encoding schemes come into play.

Definition 3: A character encoding scheme (or CES) specifies both a CEF and the way that the code unit sequences of the CEF are serialized into bytes.

The Unicode standard specifies 3 character encoding forms (as discussed above) and 7 character encoding schemes, as follows:

UTF-32BE: UTF-32 plus Big Endian serialization
UTF-32LE: UTF-32 plus Little Endian serialization
UTF-32: Uses Big Endian by default (or a byte order mark, see below). In this case, the CEF and the CES have the same name, which must be disambiguated by the context.
UTF-16BE: UTF-16 plus Big Endian serialization
UTF-16LE: UTF-16 plus Little Endian serialization
UTF-16: Similar case to UTF-32 in the third bullet point above
UTF-8: Since each code point maps to a sequence of code units that are bytes already, there is no need to serialize the code units. The bytes are serialized in the same order as they appear in the code unit sequence.

Let's now consider some examples. keeping in mind that two hex characters (as used in the following examples) together correspond to one byte.

Example 1: In the UTF-8 scheme, the UTF-8 (CEF) code unit sequence

4D D0 B0 E4 BA 8C F0 90 8C 82

is directly serialized as:

4D D0 B0 E4 BA 8C F0 90 8C 82

Example 2: In the UTF-16BE scheme, the UTF-16 code unit sequence

004D 0430 4E8C D800 DF02

is serialized as:

00 4D 04 30 4E 8C D8 00 DF 02

Example 3: In the UTF-16LE scheme, the UTF-16 code unit sequence

004D 0430 4E8C D800 DF02

is serialized as:

4D 00 30 04 8C 4E 00 D8 02 DF

Example 4: In the UTF-32BE scheme, the UTF-32 code unit sequence

0000004D 00000430 00004E8C 00010302

is serialized as:

00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02

Example 5: In the UTF-32LE scheme, the UTF-32 code unit sequence

0000004D 00000430 00004E8C 00010302

is serialized as:

4D 00 00 00 30 04 00 00 8C 4E 00 00 02 03 01 00

There is one more feature of encoding schemes that I'd like to mention. How do Unicode text processors determine the endianness of a piece of Unicode text when the byte order is not specified, as in the case of the UTF-16 and UTF-32 encoding schemes?

Definition 4: The Unicode standard defines a special character called the byte order mark or BOM that appears at the head of a data stream in order to unambiguously signal the byte order of the code units that follow.

For example, in the UTF-16 encoding scheme, an initial byte sequence "FE FF" is interpreted as a byte order mark indicating Big Endian order. In contrast, the initial byte sequence "FF FE" indicates Little Endian order.

Note that the BOM is not part of the contents of the file or the data in the packet. U+FEFF and U+FFFE were chosen for this role because they are highly unlikely to represent encoded text. In fact, U+FFFE is a noncharacter and U+FEFF is the format character named ZERO WIDTH NO-BREAK SPACE (see Part 2 of this tutorial).

Let's conclude with an example involving the BOM.

Example 6: Consider the UTF-16 code unit sequence: D800 DF02

In UTF-16BE, it is serialized as: D8 00 DF 02

In UTF-16LE, it is serialized as: 00 D8 02 DF

But in UTF-16, it could be serialized as:

FE FF D8 00 DF 02 (Big Endian with BOM),

FF FE 00 D8 02 DF (Little Endian with BOM), or

D8 00 DF 02 (Big Endian, no BOM, i.e., the default).

To summarize this post, the mapping from code points to bit strings show above as:

bit pattern	$ \longleftrightarrow $	code point
(one or more code units)

in Unicode is really a two-step mapping as follows:

byte sequence

$ \stackrel{byte\ ordering}{\longleftrightarrow} $

code unit sequence

$\stackrel{CEF}{\longleftrightarrow} $

code point

Having covered the 7 character encoding schemes in sufficient detail, we will turn our attention back to the 3 character encoding forms in the next three parts of this tutorial.

This blog post is based on Sections 2.5, 2.6, 3.9 and 3.10 of The Unicode® Standard Version 8.0 – Core Specification (August 2015) and Unicode Technical Report #17 (2008).

Sunday, May 22, 2016

Unicode Tutorial - Part 2: Code space

Let's start by reviewing the general framework for encoding schemes described in this post:

bit pattern	$ \longleftrightarrow $	code point	$ \longleftrightarrow $	(abstract)
(one or more code units)				character

in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding. This mapping is one-to-one: each abstract character is assigned a unique code point and each code point is assigned to exactly one character. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In this post, we focus on the rightmost mapping, that is, the way code points are assigned to characters.

In the Unicode standard, the code space contains 1,114,112 numbers, namely the integers from 0 to 1,114,111. Therefore, the cardinality of the character set of Unicode is also 1,114,112. To make the code points a bit shorter, we will write them using the hexadecimal notation. So, the code points in Unicode range from 0x000000 to 0x10FFFF, where '0x' is the conventional prefix to indicate that the following string is a number represented in hex notation.

Actually, the conventional prefix for Unicode code points is 'U+'. Therefore, each character in the Unicode repertoire can be identified by its code point, denoted U+xxxxxx, where each 'x' is an hex digit: 0, 1, 2, ... , 9, A, ... , E or F, keeping in mind that the two leftmost (most significant) hex digits can only be 00, 01, 02, ... , 09, 0A, ... , 0F, or 10. When the leading digits in this group of 2 are 0s, they are usually omitted. For example, the code point for the upper case A, namely U+000041, is typically written as U+0041. The reason why the remaining two leading 0s in U+0041 are not dropped (to yield U+41, which would be even more concise) is yet another convention. Since the 65,536 most common characters worldwide are grouped together at the beginning of the code space, their code points (from 0 to 65,535) are conventionally represented by their 4 least significant hex digits, from U+0000 to U+FFFF.

In addition to a unique code point (a number), each character in the Unicode repertoire is assigned a unique official Unicode name. For example, the Unicode name for U+0041 is "LATIN CAPITAL LETTER A'. Another example is



with code point U+1F47D and name "ALIEN, EXTRATERRESTRIAL". Here is a complete list of code points and names, as of Version 8 (August 2015). Furthermore, this site gives many details on each character (note that the code point of the character is part of the URL).

The Unicode standard lists 7 types of code points (see this list of Unicode character categories):

graphic code points are for characters that are meant to be visible when displayed, that is, those that are associated with at least one glyph. There are 6 general categories of graphic characters, namely:

letters (L category)
marks (M), e.g., accents, tilde, etc.
numbers (N)
punctuation (P)
symbols (S), e.g., mathematical symbols such as +, currency symbols such as $
spaces (Zs), e.g., whitespace character, non-breaking space, etc.

format code points are for invisible characters that affect neighboring characters, e.g., line breaks, paragraph breaks, left-to-right and right-to-left marks (used to set the way adjacent characters are grouped with respect to text direction), etc.
control code points are for characters that are invisible and take no space. They are not used by Unicode but rather are meaningful to and interpreted by other standards, protocols, programming languages, etc., e.g., the null character that is interpreted as the string terminator in the C programming language, or the carriage return, line feed, and tabulation characters.
private-use code points are not assigned any characters by the Unicode standard because they are meant for third parties to define their own characters while guaranteeing that they will not conflict with any Unicode code point assignments.
surrogate code points are not assigned to any abstract characters because each one of them is used as part of a pair of code units in the UTF-16 encoding scheme to be described in a later part of this tutorial. No one of these 2,048 code points can be interchanged by itself. Only surrogate pairs have meaning in the standard.
noncharacter code points are reserved for internal use only and are not assigned any characters.
reserved code points are not (yet) assigned to abstract characters but they are (unlike surrogate code points) assignable at a later time, when needed. These code points embody the open nature of the Unicode repertoire.

In Version 8.0 of the Unicode standard, the total number of graphic characters is 120,520. Adding 152 format characters, 65 control characters, and 137,468 private-use characters yields a total of 258,205 assigned characters. After adding the surrogate code points and the 66 noncharacters, a total of 260,319 code points are currently being used, The standard calls these designated code points.

This leaves 853,793 unused code points, out of a total of 1,114,112. In other words, 76.63% of the code space is left unused to accommodate future needs.

To get a sense of how slowly the set of designated code points has grown over time, here is how this number has evolved over time:

	V5.2	V6.0	V6.1	V6.2	V6.3	V7.0	V8.0
	2009	2010	2012	2012	2013	2014	2015
# of designated	246,943	249,031	249,763	249,764	249,769	252,603	260,319
code points

And here it is as a line graph since V1.0 released in 1991:

Note the linear scale on the vertical axis and recall that the size of the Unicode code space is 1,114,112.

Before I wrap up this post, let me mention another way to look at the code space.

So far, we have looked at the code points by type. However, not all of the code points of a given type are located contiguously in the code space. This is because code points are grouped together in the code space based on their usage, rather than by type. With some exceptions (e.g., all of the ASCII characters are grouped at the bottom of the code space, even though they may be used in several languages), characters that are often used together are near each other in the code space.

In particular, all of the code points for a given language are typically grouped together. So all of the letters of the Latin alphabet are found in one contiguous group and all of the letters of the Greek alphabet are found in another contiguous group. But these two groups of letters are separated by many non-letter characters. For example, the left and right curly brackets ('{' and '}'), the copyright symbol, and the tilde ('~') all have code points that are located between the last letter of the Latin alphabet and the first letter of the Greek alphabet.

In short, code points are organized in Unicode groups, which can be visualized in these charts produced by the Unicode Consortium.

Furthermore, each group is located within one plane of the code space, where a Unicode plane is a contiguous group of 65,536 code points. Since the code space ranges from U+000000 and U+10FFFF, it can naturally be broken down into 17 planes indexed from 0x00 to 0x10 (the two most significant or leftmost hex digits in a code point), with each plan containing 65,536 code points ranging from 0x0000 to 0xFFFF (the four least significant or rightmost hex digits in a code point).

For example, the character U+023456 is located at position 0x3456 in plane number 0x02.

Plane 0 is called the Basic Multilingual Plane or BMP. It contains the most common characters for all of the modern scripts in the world. Therefore, the vast majority of all Unicode characters used in all textual data is located in the BMP.

Plane 1 is called the Supplementary Multilingual Plane or SMP. It contains the characters for scripts or symbols that either did not fit into the BMP or are seldom used, including many historic scripts.

Plane 2 is called the Supplementary Ideographic Plane or SIP. It contains those CJK characters that could not be fit in the BMP, namely the CJK (Chinese, Japanese and Korean) Unified Ideographs Extension B.

Planes 3 through 13 are completely unused in Version 8.0. They are reserved for future use in support of the universal (and thus open) nature of Unicode.

Plane 14 is called the Supplementary Special-purpose Plane or SSP. It contains only two small blocks of non-graphic characters.

Planes 15 and 16 are called the Supplementary Private Use Areas A and B (respectively). They contain private-use characters, as described above.

Let me close by referring you to a visualization of the Unicode code space at one pixel per code point posted at Ryan Flynn's blog.

This blog post is based on Sections 2.4, 2.5, 2.6, 2.8, 2.9 and D.1 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

UTF-16BE:
UTF-16LE:
UTF-16:	(UTF-16/no BOM)
	(UTF-16/BE BOM)
	(UTF-16/LE BOM)

UTF-16BE:
UTF-16LE:
UTF-16:	(UTF-16/no BOM)
	(UTF-16/BE BOM)
	(UTF-16/LE BOM)

bit pattern	\( \longleftrightarrow \)	code point	\( \longleftrightarrow \)	(abstract)
(one or more code units)				character