Monday, May 30, 2016

Unicode Tutorial - Part 5: UTF-16

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-16, both the encoding form and the encoding scheme.

Recall that UTF-16 is a variable-width encoding: while every single code unit is 16 bits long (hence the name UTF-16), some code points are encoded as a single code unit while others are encoded as a sequence of two code units.

More precisely, each valid code point that fits within 16 bits, namely each one in the BMP, simply gets encoded as a 16-bit code unit that is identical to the binary representation of numerical code point itself. For example, the code point U+00AB12 gets encoded by the single code unit 0xAB12.

All supplementary characters (i.e., those in planes 1 through 16) have code points that do not fit within 16 bits. Each supplementary character gets encoded as a sequence of two 16-bit code units called a surrogate pair.

Each surrogate pair comprises a high (or leading) surrogate and a low (or trailing) surrogate. Since each surrogate is a 16-bit code unit, it must belong to the range of the BMP. Indeed, two specific sub-ranges of the BMP were reserved to encode surrogates, namely:
  • the range 0xD800 to 0xDBFF is reserved for high surrogates, and
  • the range 0xDC00 to 0xDFFF is reserved for low surrogates.

It is unfortunate that all of the values in the high-surrogate range are smaller than all of the values in the low-surrogate range. The "high" and "low" prefixes refer not to the values of the surrogates but to their position in the pair, namely <high surrogate, low surrogate>, where "high" and "low" mean "most significant" and "least significant", respectively.

Note that each one of the surrogate ranges contains exactly \(2^{10} =1,024\) code points. Therefore, the set of surrogate pairs can encode exactly \(2^{10}\cdot 2^{10} = 2^{20} = 1,048,576\) code points, which is precisely the number of supplementary characters (16 planes of 65,536 code points each together add up to \(2^4\cdot 2^{16} = 2^{20}\) code points).

Because Unicode is designed to avoid overlap between the sequences of code units of different characters, the code units between 0xD800 and 0xDFFF (i.e., those reserved for either half of a surrogate pair) are NOT allowed, on their own, to represent any characters.

As a result, since 2,048 code points are reserved for surrogate values, the actual number of available code points in the range 0x000000-0x10FFF is not 1,114,112 but rather 1,114,112 - 2,048 = 1,112,064. Of course, since all Unicode encodings cover exactly the same set of characters, the range of code points between 0xD800 and 0xDFFF are also forbidden (on their own) in UTF-8 and UTF-32.

Because of the non-overlap constraint imposed on distinct code unit sequences, a well-formed UTF-16-encoded text may not start with a low surrogate. In fact, if any low surrogate appears in UTF-16-encoded text, it must be preceded by a high surrogate. In other words, it may not be preceded by another low surrogate, by any non-surrogate code unit, nor be the first character in the text.

Similarly, in well-formed UTF-16-encoded text, any high surrogate must be followed by a low surrogate. It may not appear at the end of the encoded text nor be followed by another high surrogate or a non-surrogate.

For detailed discussions of the UTF-16 encoding and decoding algorithms, see the following two parts of this tutorial.

Note that our discussion so far has been limited to the UTF-16 encoding form that assigns a sequence of one or more 16-bit code units to each code point.

We now turn our attention to the UTF-16 encoding schemes, which produce byte sequences from code unit sequences (see a detailed discussion of this distinction).

Each UTF-16 encoding scheme specifies a byte order for the two bytes within each code unit, while the order of the code units themselves within their sequence is not affected.

The following examples illustrate the outcome of the full encoding process using the three UTF-16, UTF-16BE, and UTF-16LE encoding schemes.

code point UTF-16
code unit
sequence
UTF-16BE
hex bytes
UTF-16LE
hex bytes
UTF-16
hex bytes
U+0041 0x0041 00 41 41 00 00 41,
FE FF 00 41, or
FF FE 41 00
U+D81A Invalid code point (falls in the high surrogate range)
U+DE83 Invalid code point (falls in the low surrogate range)
U+E012 0xE012 E0 12 12 E0 E0 12,
FE FF E0 12, or
FF FE 12 E0
U+010302 0xD800 0xDF02 D8 00 DF 02 00 D8 02 DF D8 00 DF 02,
FE FF D8 00 DF 02, or
FF FE 00 D8 02 DF
U+10FDCB 0xDBFF 0xDDCB DB FF DD CB FF DD CB DD DB FF DD CB,
FE FF DB FF DD CB, or
FF FE FF DF CB DD
U+10FFFF 0xDBFF 0xDFFF DB FF DF FF FF DB FF DF DB FF DF FF,
FE FF DB FF DF FF, or
FF FE FF DB FF DF
U+12345A Invalid code point (larger than 0x10FFFF)

If you need to review the byte order mark (BOM), namely the byte sequences 0xFEFF and 0xFFFE used in the UTF-16 encoding scheme above, refer to this earlier part of the tutorial.

For complete details on the algorithm used to encode a code point into a sequence of code units using the UTF-16 encoding form, refer to the next part of this tutorial.


This blog post is based on Sections 2.5, 3.8, 3.9 and 5.4 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

Saturday, May 28, 2016

Unicode Tutorial - Part 6: Visualization of the UTF-16 Encoding Algorithm


UTF-16 Encoding Algorithm


Below is an interactive visualization of the encoding algorithm that takes as input a Unicode code point in hex notation and converts it to a byte string according to the UTF-16, UTF-16BE, and UTF-16LE encoding schemes.


Step 1: Enter a code point between 0 and 10FFFF in hex notation
0x

While the visualization above works for all code points, it is interesting to try specific code points to better understand the encoding of supplementary characters, that is, those that are outside the Basic Multilingual Plane (BMP).

For example, the smallest supplementary character has code point 0x10000. which reduces to 0x0 after the subtraction in Step 2.  Later in Step 4, inserting the bits 110110 in front of the first half (of all zeros) yields the value 0xD800, which is the smallest possible value for a high surrogate.

Similarly, inserting the bits 110111 in front of the second half (of all zeros) yields the value 0xDC00, which is the smallest possible value for a low surrogate.

Therefore, in UTF-16 encoding, the smallest supplementary character gets assigned the smallest possible (component-wise) surrogate pair.

At the other end of the spectrum, the character with the largest possible code point, namely 0x10FFFF, gets assigned the largest possible (component-wise) surrogate pair, namely:

 < 0xDBFF, 0xDFFF >

More generally, there is a one-to-one mapping between the set of supplementary character code points and the set of all surrogate pairs.

In the next post, we will discuss the inverse mapping, that is, from byte sequences to Unicode character sequences.



The algorithm visualized in this post is described in plain text in this RFC.

Friday, May 27, 2016

Unicode Tutorial - Part 4: UTF-32, with Java code samples

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-32, both the encoding form and the encoding scheme.

Recall that the UTF-32 encoding form has fixed width because it maps each code point in the Unicode coded character set to a single 32-bit code unit. Since the number of Unicode characters is 1,114,112, each code point is a number between 0x000000 and 0x10FFFF and thus fits within 21 bits.

Since a single 32-bit integer can represent a code point with 11 bits to spare, UTF-32 is the simplest of all Unicode encoding forms: it encodes each code point value as a 32-bit integer with the 11 leading bits set to 0. For example, the code point 97 (in decimal), that is, 0x61,  is represented by the code unit:

00000000 00000000 00000000 01100001
                                                         
Note that any UTF-32 code unit whose value is greater than 0x0010FFFF is ill-formed (too large). Furthermore, because surrogate values used in UTF-16 (to be discussed in the next part of this tutorial) are not by themselves valid code points, UTF-32 code units in the range 0x00D800 through 0x00DFFF are also ill-formed.

The rest of this post will use the Java programming language to manipulate character encodings in general and UTF-32 more specifically.

The following Java program uses the availableCharsets() and defaultCharset() static methods of the Charset class to list all of the available character encodings as well as the default charset on the machine on which it is executed.

Note that the former method returns a SortedMap data structure. On lines 14-19, I loop through the key-value pairs in this structure to print the canonical name of the Charset (the key of the pair) followed by the list of aliases of the Charset (the value of the pair).

import java.nio.charset.Charset;
import java.util.Map;
import java.util.SortedMap;

class CharacterEncodings {
    public static void main(String[] args) {
        Charset defaultCharset = Charset.defaultCharset();
        String defaultCharsetName = defaultCharset.displayName();
        SortedMap<String,Charset> availableCharsets = Charset.availableCharsets();
        System.out.println("Your JVM supports the following " + 
                           availableCharsets.size() +
                           " character sets,\nwith the default" +
                           " one marked by asterisks:\n");
        for(Map.Entry<String,Charset> entry : availableCharsets.entrySet()) {
            if (defaultCharsetName.equalsIgnoreCase(entry.getValue().displayName())) {
                System.out.print(" ( * * * * * * * * * * * * * * * * * * * * ) ");
            }
            System.out.println(entry.getKey() + " " + entry.getValue().aliases());
        }
    }// main method   
}// CharacterEncodings class

The (truncated) output of this program is shown below. While the UTF-32 encoding is available on my system, it is not the one used by default (e.g., when writing text files). Instead, UTF-8 is the default, which is not surprising since it is a common encoding for text files, for HTML files, etc. In fact, UTF-8 is the dominant encoding on the web.

Your JVM supports the following 169 character sets,
with the default one marked by asterisks:

Big5 [csBig5]
...
TIS-620 [tis620.2533, tis620]
US-ASCII [cp367, ascii7, ISO646-US, 646, csASCII, us, iso_646.irv:1983, 
          ISO_646.irv:1991, IBM367, ASCII, default, ANSI_X3.4-1986, 
          ANSI_X3.4-1968, iso-ir-6]
UTF-16 [utf16, UnicodeBig, UTF_16, unicode]
UTF-16BE [X-UTF-16BE, UTF_16BE, ISO-10646-UCS-2, UnicodeBigUnmarked]
UTF-16LE [UnicodeLittleUnmarked, UTF_16LE, X-UTF-16LE]
UTF-32 [UTF32, UTF_32]
UTF-32BE [X-UTF-32BE, UTF_32BE]
UTF-32LE [X-UTF-32LE, UTF_32LE]
 ( * * * * * * * * * * * * * * * * * * * * ) UTF-8 [unicode-1-1-utf-8, UTF8]
windows-1250 [cp1250, cp5346]
...
x-windows-iso2022jp [windows-iso2022jp]


Note that the output of this program will most likely be different on your machine. For example, the default charset (indicated by asterisks above) is determined during the startup phase of the JVM and typically depends on the locale and the charset of your operating system.

In the following program, I start with an array of 60 bytes (lines 7 through 23). This array contains the Unicode code points of 15 characters (see the comment at the end of each line), with 4 bytes per code point. The main program then converts (or "decodes") this byte array into an array of 15 Unicode characters.

First, on line 35, I create a Charset instance for the UTF-32 encoding using the static forName() method. Then, on line 39,  I invoke the decode() instance method that takes in a ByteBuffer (that simply wraps the byte[] object) and returns a CharBuffer containing 15 Java characters, which is then converted to a Java string using its toString() method and finally printed, producing the following output:


Part 1:
©Crêpes à gogo🍴


import java.nio.charset.Charset;
import java.nio.CharBuffer;
import java.nio.ByteBuffer;

class UTF32 {

    static byte[] bytes  = { 
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xA9, // copyright
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x43, // C
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x72, // r
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xEA, // e with ^
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x70, // p
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x65, // e
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x73, // s
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x20, // " "
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xE0, // a with `
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x20, // " " 
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x67, // g
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x6F, // o
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x67, // g
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x6F, // o
      (byte) 0x00, (byte) 0x01, (byte) 0xF3, (byte) 0x74  // knife & fork
    };

    // this method (for fun) uses a classic trick to swap two array 
    // elements without using a temporary variable; note that it only 
    // works if i != j 
    static void swapByteArrayElements(byte[] a, int i, int j) {
        a[i] = (byte) (a[i] + a[j]);
        a[j] = (byte) (a[i] - a[j]);
        a[i] = (byte) (a[i] - a[j]);
    }// swapByteArrayElements method

    public static void main(String[] args) throws Exception {
        Charset utf32charset = Charset.forName("UTF-32");

        // part 1
        String str = 
             utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 1:\n" + str);

        // part 2
        bytes[2] = (byte) 0xFE;
        bytes[3] = (byte) 0xFF;
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 2:\n" + str);
 
        // part 3
        bytes[0] = (byte) 0xFF;
        bytes[1] = (byte) 0xFE;
        bytes[2] = (byte) 0x00;
        bytes[3] = (byte) 0x00;
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 3:\n" + str);
 
        // part 4
        for(int i=4; i<bytes.length-3; i+=4) {
            swapByteArrayElements(bytes,i,i+3);
            swapByteArrayElements(bytes,i+1,i+2);
        }
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 4:\n" + str);
    }// main method   
}// UTF32class

Before discussing the rest of this program, it is important to note that Java internally encodes each Character/char in UTF-16 (which will be discussed in the next part of this tutorial) and each String object as an array of such characters.

Therefore, UTF-16 is the "native" representation of strings in Java. For this reason, when reading text in other representations, Java has to "decode" these external strings that use another encoding into its internal representation.

Conversely, when converting a Java string to another encoding, Java must "encode" the string using an external encoding.

In our case, the external encoding is UTF-32 because I represented the "external" or "non-native" string as a byte array in which four consecutive bytes encode the code point of a single Unicode character.

Back to line 39, utf32charset.decode(ByteBuffer.wrap(bytes)) converts the ByteBuffer (passed as argument to the decode method) to a CharBuffer (returned by the decode method) using the character encoding UTF-32 that the utf32charset object represents.

This is where the conversion or change of encoding takes place. Again, the returned CharBuffer object contains a (native) array of Java chars, each of which is a UTF-16 encoded character. This array can now be converted to a string (with a call to the toString()) method for printing purposes.

This concludes our discussion of Part 1 of this program.

Before we discuss the last three parts of this program, recall that strings in the UTF-32 encoding scheme may or may not begin with a "byte order mark" or BOM. When there is no BOM at the beginning of the input ByteBuffer as in Part 1 above, the byte order of the UTF-32 encoding scheme is assumed to be big-endian by default, according to which the most significant byte of each four-byte integer appears first.

In other words, the four bytes 0x00 0x00 0x00 0xA9 represent the integer 0x000000A9. With the little-endian scheme, these four bytes would represent the integer 0xA9000000 instead.

Part 1 of the program above relies on this default (big-endian) byte ordering. In Part 2 (lines 43 and 44), I replace the first four bytes of the ByteBuffer with the values:

0x00 0x00 0xFE 0xFF

This is the code point for the BOM, namely the special character that denotes the big-endian byte order. Since this character replaces the first character, the copyright symbol is erased. Furthermore, since the BOM, when present, is not part of the string itself (it is like meta-data used to describe the string, namely its byte order), the output for Part 2 is as follows:


Part 2:
Crêpes à gogo🍴

In this output, the characters are displayed correctly because the BOM explicitly states that the byte order is big-endian and the following 4-byte integers are indeed listed in big-endian order.

In Part 3, I first replace (lines 49 through 52) the first four bytes with the values:

0xFF 0xFE 0x00 0x00

Note that these values are simply the same bytes as the BOM above but in reverse order. In fact, they represent the byte order mark for the little-endian scheme. This means that these first four bytes are not going to be part of the decoded string. However, their presence will force the JVM to swap the order of all of the following groups of 4 bytes.

For example, the four bytes of the first actual character in the ByteArray, namely;
0x00 0x00 0x00 0x43

representing the character 'C' will be reordered to:

0x43 0x00 0x00 0x00

and then encoded in UTF-32. However, these reordered four bytes now encode an invalid code point, since the largest possible code point is 0x0010FFFF. This will happen for all 14 characters in the ByteArray (not counting the initial BOM), yielding the following garbage output:


Part 3:
�

in which the � character is used by default to replace characters with an invalid code point.

To fix this, in Part 4, I keep the little-endian byte order mark, but reorder the following bytes in groups of 4 (lines 57 through 60). After this code is executed, the ByteBuffer looks like this:

(byte) 0xFF, (byte) 0xFE, (byte) 0x00, (byte) 0x00, // used to be copyright
(byte) 0x43, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be C
(byte) 0x72, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be r
(byte) 0xEA, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be e with ^
etc.

With this preliminary swap completed, the last output produced by this program is thus:


Part 4:
Crêpes à gogo🍴

Note that the copyright symbol has been replaced by the BOM (which is not part of the string and thus not displayed) and that the following characters are correct because now the BOM explicitly states that the byte order is little-endian and the following 4-byte integers are indeed listed in little-endian order.

To summarize:
  • Part 1 represents the "external" string (that is, the byte array) encoded in UTF-32 with no BOM and thus a big-endian default byte order.
  • Part 2 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the big-endian byte order on the rest of the bytes.
  • Part 3 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the little-endian byte order on the rest of the bytes but with the following bytes listed in big-endian order.
  • Part 4 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the little-endian byte order on the rest of the bytes, which is the correct order.
Therefore, only Part 3 of the code produces unwanted output.

Tuesday, May 24, 2016

Unicode Tutorial - Part 3: Character Encoding Form versus Character Encoding Scheme

Let's start by reviewing the general framework for encoding schemes described in this post:


bit pattern\( \longleftrightarrow \)code point\( \longleftrightarrow \)(abstract)
(one or more code units)character


in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding, as discussed in Part 2 of this tutorial. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In the previous post, we discussed the mapping from characters to code points. In this post, we focus on the leftmost mapping, that is, the way bit patterns are assigned to code points.

Since Unicode is an international standard for data interchange over computer networks, the code points, which are numbers, must be encoded as fixed-size bit patterns (e.g, octets or 8-bit bytes, 32-bit words, etc.) for computers to process, transmit and store. These fixed-size bit patterns are called code units and this fixed number of bits is called the code unit size (see this post for a refresher).

Each character encoding model is therefore characterized by 1) its code unit size and 2) the way each code point is encoded into (or mapped to) a sequence of one or more code units. In the Unicode standard, this mapping is called a "character encoding form" (the "character encoding scheme" is different and will be defined below).

Definition 1: A character encoding form (or CEF) is a mapping from the code space (i.e., the set of code points) to the set of sequences of code units.

In other words, in the Unicode standard, a CEF specifies how each integer that is a valid code point is represented as a bit string, which is the concatenation of one or more fixed-size code units.

The Unicode standard defines three distinct CEFs that use 8-bit, 16- bit, and 32-bit units and are named UTF-8, UTF-16, and UTF-32, respectively. The UTF acronym is a carryover from earlier terminology and stands for Unicode (or UCS) Transformation Format. Each one of UTF-8, UTF-16 and UTF-32 will be described in its own post in upcoming parts of this tutorial.

Note that each one of these three CEFs is a legitimate encoding of all Unicode characters. They differ not only in the bit strings that they assign to the code points and in their code unit size, but also in the number of code units in the sequences that are assigned to the code point. This leads us to define two kinds of CEFs.

Definition 2: A CEF has fixed width if and only if it maps all code points in the code space to (code unit) sequences of the same length. Otherwise, the CEF is said to have variable width.

For example, UTF-32 has fixed width because it maps all code points to a single 32-bit code unit. Similarly ISO/IEC 8859-1 has fixed width because it maps all code points in its repertoire to a single 8-bit code unit.

In contrast, UTF-16 has variable width because it maps some code points to one 16-bit code unit and others to a sequence of two 16-bit code units. Similarly, UTF-8 has variable width because the sequences that are mapped to code points may contain 1, 2, 3, or 4 8-bit code units.

Character encoding forms define a one-to-one mapping between code points and code unit sequences. As such, they are sufficient when dealing with characters stored in memory of any given computer. After all, a code point is just an integer and a code unit sequence is just a bit string. Any computer can handle those data types easily.

However, things are a bit more complicated when textual data must be stored in a file on disk or sent to another computer over a network. Since the basic unit of storage or communication is the octet or 8-bit byte (thereafter referred to simply as a byte), integers and bit strings that do not fit in one byte must be handled carefully. In particular, these bit strings must be serialized, that is, broken down into a sequence of bytes. 

Unfortunately, different computers (more specifically, processor architectures) may serialize multi-byte bit strings in different ways. For example, Little Endian architectures serialize bit strings from least significant to most significant (i.e., they put the "little end" first), while Big Endian architectures serialize bit strings from most significant to least significant (i.e., they put the "big end" first).

For example, the (made up) 3-byte sequence 0x2468AC would be serialized as  24 68 AC in Big Endian and as AC 68 24 in Little Endian.

This difference does not matter for operations on code unit sequences that take place within the memory of a given computer, since each computer consistently uses the same byte order for all of its internal operations. 

However, the byte order becomes important when encoded text (i.e., the code unit sequences of its characters) is communicated through data streams that connect to file storage or network communication links. To avoid corruption of the encoded text, the computers involved must agree on the byte serialization scheme.

This is where character encoding schemes come into play.

Definition 3: A character encoding scheme (or CES) specifies both a CEF and the way that the code unit sequences of the CEF are serialized into bytes. 

The Unicode standard specifies 3 character encoding forms (as discussed above) and 7 character encoding schemes, as follows:
  • UTF-32BE: UTF-32 plus Big Endian serialization
  • UTF-32LE: UTF-32 plus Little Endian serialization
  • UTF-32: Uses Big Endian by default (or a byte order mark, see below). In this case, the CEF and the CES have the same name, which must be disambiguated by the context.
  • UTF-16BE: UTF-16 plus Big Endian serialization
  • UTF-16LE: UTF-16 plus Little Endian serialization
  • UTF-16: Similar case to UTF-32 in the third bullet point above
  • UTF-8: Since each code point maps to a sequence of code units that are bytes already, there is no need to serialize the code units. The bytes are serialized in the same order as they appear in the code unit sequence.
Let's now consider some examples. keeping in mind that two hex characters (as used in the following examples) together correspond to one byte.

Example 1: In the UTF-8 scheme, the UTF-8 (CEF) code unit sequence
4D D0 B0 E4 BA 8C F0 90 8C 82 
is directly serialized as: 
4D D0 B0 E4 BA 8C F0 90 8C 82 

Example 2: In the UTF-16BE scheme, the UTF-16 code unit sequence
004D 0430 4E8C D800 DF02 
is serialized as: 
00 4D 04 30 4E 8C D8 00 DF 02 

Example 3: In the UTF-16LE scheme, the UTF-16 code unit sequence
004D 0430 4E8C D800 DF02 
is serialized as: 
4D 00 30 04 8C 4E 00 D8 02 DF 

Example 4: In the UTF-32BE scheme, the UTF-32 code unit sequence
0000004D 00000430 00004E8C 00010302 
is serialized as: 
00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02

Example 5: In the UTF-32LE scheme, the UTF-32 code unit sequence
0000004D 00000430 00004E8C 00010302 
is serialized as: 
4D 00 00 00 30 04 00 00 8C 4E 00 00 02 03 01 00

There is one more feature of encoding schemes that I'd like to mention. How do Unicode text processors determine the endianness of a piece of Unicode text when the byte order is not specified, as in the case of the UTF-16 and UTF-32 encoding schemes?

Definition 4: The Unicode standard defines a special character called the byte order mark or BOM that appears at the head of a data stream in order to unambiguously signal the byte order of the code units that follow.

For example, in the UTF-16 encoding scheme, an initial byte sequence "FE FF" is interpreted as a byte order mark indicating Big Endian order. In contrast, the initial byte sequence "FF FE" indicates Little Endian order. 

Note that the BOM is not part of the contents of the file or the data in the packet. U+FEFF and U+FFFE were chosen for this role because they are highly unlikely to represent encoded text. In fact, U+FFFE is a noncharacter and U+FEFF is the format character named ZERO WIDTH NO-BREAK SPACE (see Part 2 of this tutorial).

Let's conclude with an example involving the BOM.

Example 6: Consider the UTF-16 code unit sequence: D800 DF02 
In UTF-16BE, it is serialized as: D8 00 DF 02
In UTF-16LE, it is serialized as: 00 D8 02 DF 
But in UTF-16, it could be serialized as:
  • FE FF D8 00 DF 02 (Big Endian with BOM), 
  • FF FE 00 D8 02 DF (Little Endian with BOM), or
  • D8 00 DF 02 (Big Endian, no BOM, i.e., the default). 
To summarize this post, the mapping from code points to bit strings show above as:

bit pattern\( \longleftrightarrow \)code point
(one or more code units)

in Unicode is really a two-step mapping as follows:

byte sequence\( \stackrel{byte\ ordering}{\longleftrightarrow} \)code unit sequence\(\stackrel{CEF}{\longleftrightarrow} \)code point

Having covered the 7 character encoding schemes in sufficient detail, we will turn our attention back to the 3 character encoding forms in the next three parts of this tutorial.

This blog post is based on Sections 2.5, 2.6, 3.9 and 3.10 of The Unicode® Standard Version 8.0 – Core Specification (August 2015) and Unicode Technical Report #17 (2008).

Sunday, May 22, 2016

Unicode Tutorial - Part 2: Code space


Let's start by reviewing the general framework for encoding schemes described in this post:


bit pattern\( \longleftrightarrow \)code point\( \longleftrightarrow \)(abstract)
(one or more code units)character


in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding. This mapping is one-to-one: each abstract character is assigned a unique code point and each code point is assigned to exactly one character. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In this post, we focus on the rightmost mapping, that is, the way code points are assigned to characters.

In the Unicode standard, the code space contains 1,114,112 numbers, namely the integers from 0 to 1,114,111. Therefore, the cardinality of the character set of Unicode is also 1,114,112. To make the code points a bit shorter, we will write them using the hexadecimal notation. So, the code points in Unicode range from 0x000000 to 0x10FFFF, where '0x' is the conventional prefix to indicate that the following string is a number represented in hex notation.

Actually, the conventional prefix for Unicode code points is 'U+'. Therefore, each character in the Unicode repertoire can be identified by its code point, denoted U+xxxxxx, where each 'x' is an hex digit: 0, 1, 2, ... , 9, A, ... , E or F, keeping in mind that the two leftmost (most significant) hex digits can only be 00, 01, 02, ... , 09, 0A, ... ,  0F, or 10. When the leading digits in this group of 2 are 0s, they are usually omitted. For example, the code point for the upper case A, namely U+000041, is typically written as U+0041. The reason why the remaining two leading 0s in U+0041 are not dropped (to yield U+41, which would be even more concise) is yet another convention. Since the 65,536 most common characters worldwide are grouped together at the beginning of the code space, their code points (from 0 to 65,535) are conventionally represented by their 4 least significant hex digits, from U+0000 to U+FFFF.

In addition to a unique code point (a number), each character in the Unicode repertoire is assigned a unique official Unicode name. For example, the Unicode name for U+0041 is "LATIN CAPITAL LETTER A'. Another example is


with code point U+1F47D and name "ALIEN, EXTRATERRESTRIAL". Here is a complete list of code points and names, as of Version 8 (August 2015). Furthermore, this site gives many details on each character (note that the code point of the character is part of the URL).

The Unicode standard lists 7 types of code points (see this list of Unicode character categories):
  1. graphic code points are for characters that are meant to be visible when displayed, that is, those that are associated with at least one glyph. There are 6 general categories of graphic characters, namely:
    • letters (L category)
    • marks (M), e.g., accents, tilde, etc.
    • numbers (N)
    • punctuation (P)
    • symbols (S), e.g., mathematical symbols such as +, currency symbols such as $
    • spaces (Zs), e.g., whitespace character, non-breaking space, etc.
  2. format code points are for invisible characters that affect neighboring characters, e.g., line breaks, paragraph breaks, left-to-right and right-to-left marks (used to set the way adjacent characters are grouped with respect to text direction), etc.
  3. control code points are for characters that are invisible and take no space. They are not used by Unicode but rather are meaningful to and interpreted by other standards, protocols, programming languages, etc., e.g., the null character that is interpreted as the string terminator in the C programming language, or the carriage return, line feed, and tabulation characters.
  4. private-use code points are not assigned any characters by the Unicode standard because they are meant for third parties to define their own characters while guaranteeing that they will not conflict with any Unicode code point assignments. 
  5. surrogate code points are not assigned to any abstract characters because each one of them is used as part of a pair of code units in the UTF-16 encoding scheme to be described in a later part of this tutorial. No one of these 2,048 code points can be interchanged by itself. Only surrogate pairs have meaning in the standard.
  6. noncharacter code points are reserved for internal use only and are not assigned any characters.
  7. reserved code points are not (yet) assigned to abstract characters but they are (unlike surrogate code points) assignable at a later time, when needed. These code points embody the open nature of the Unicode repertoire.
In Version 8.0 of the Unicode standard, the total number of graphic characters is 120,520. Adding 152 format characters, 65 control characters, and 137,468 private-use characters yields a total of 258,205 assigned characters. After adding the surrogate code points and the 66 noncharacters, a total of 260,319 code points are currently being used, The standard calls these designated code points.

This leaves 853,793 unused code points, out of a total of 1,114,112. In other words, 76.63% of the code space is left unused to accommodate future needs.

To get a sense of how slowly the set of designated code points has grown over time, here is how this number has evolved over time:

V5.2
V6.0
V6.1
V6.2
V6.3
V7.0
V8.0
2009
2010
2012
2012
2013
2014
2015
# of designated
246,943
249,031
249,763
249,764
249,769
252,603
260,319
code points

And here it is as a line graph since V1.0 released in 1991:


Note the linear scale on the vertical axis and recall that the size of the Unicode code space is 1,114,112.

Before I wrap up this post, let me mention another way to look at the code space.

So far, we have looked at the code points by type. However, not all of the code points of a given type are located contiguously in the code space. This is because code points are grouped together in the code space based on their usage, rather than by type. With some exceptions (e.g., all of the ASCII characters are grouped at the bottom of the code space, even though they may be used in several languages), characters that are often used together are near each other in the code space.

In particular, all of the code points for a given language are typically grouped together. So all of the letters of the Latin alphabet are found in one contiguous group and all of the letters of the Greek alphabet are found in another contiguous group. But these two groups of letters are separated by many non-letter characters. For example, the left and right curly brackets ('{' and '}'), the copyright symbol, and the tilde ('~') all have code points that are located between the last letter of the Latin alphabet and the first letter of the Greek alphabet.

In short, code points are organized in Unicode groups, which can be visualized in these charts produced by the Unicode Consortium.

Furthermore, each group is located within one plane of the code space, where a Unicode plane is a contiguous group of 65,536 code points. Since the code space ranges from U+000000 and U+10FFFF, it can naturally be broken down into 17 planes indexed from 0x00 to 0x10 (the two most significant or leftmost hex digits in a code point), with each plan containing 65,536 code points ranging from 0x0000 to 0xFFFF (the four least significant or rightmost hex digits in a code point).

For example, the character U+023456 is located at position 0x3456 in plane number 0x02.

Plane 0 is called the Basic Multilingual Plane or BMP. It contains the most common characters for all of the modern scripts in the world. Therefore, the vast majority of all Unicode characters used in all textual data is located in the BMP.

Plane 1 is called the Supplementary Multilingual Plane or SMP. It contains the characters for scripts or symbols that either did not fit into the BMP or are seldom used, including many historic scripts.

Plane 2 is called the Supplementary Ideographic Plane or SIP. It contains those CJK characters that could not be fit in the BMP, namely the CJK (Chinese, Japanese and Korean) Unified Ideographs Extension B.

Planes 3 through 13 are completely unused in Version 8.0. They are reserved for future use in support of the universal (and thus open) nature of Unicode.

Plane 14 is called the Supplementary Special-purpose Plane or SSP. It contains only two small blocks of non-graphic characters.

Planes 15 and 16 are called the Supplementary Private Use Areas A and B (respectively). They contain private-use characters, as described above.

Let me close by referring you to a visualization of the Unicode code space at one pixel per code point posted at Ryan Flynn's blog.


This blog post is based on Sections 2.4, 2.5, 2.6, 2.8, 2.9 and D.1 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

Friday, May 20, 2016

Unicode Tutorial - Part 1: Overview

Overview


[ Note that the definitions of (abstract) character, character encoding, repertoire, code point, and related terms used in this post can be found in this previous post. ]


Unicode is a universal encoding scheme for characters and, more generally, multilingual text. While the ASCII encoding scheme was based originally on the English language, Unicode aims to handle all past, present, and future languages and scripts. The character codes defined in the Unicode standard are identical to those in the UCS or ISO/IEC 10646 international standard. Unicode is also backward-compatible with ASCII through its UTF-8 encoding form. Unicode is the default encoding for HTML and XML. It is implemented in all modern operating systems and supported by many programming languages (e.g., Java, JavaScript,  C#).



Coverage


The Unicode encoding aims to replace the many existing encodings (e.g., see this list of character sets at the IANA web site) with a single, universal encoding standard.

The Unicode standard treats all characters equivalently, be they letters, digits, ideographic characters, mathematical symbols, etc. Its all-inclusive repertoire was designed to be able to assign a code point to each one of more than one million characters. As of August 2015, Unicode Version 8.0 assigns code points to 120,672 characters, thereby covering essentially all of the world's scripts. Most of the code points are not assigned yet, which allows the Unicode Consortium to accommodate requests for new characters motivated by changing industry needs, new scholarly endeavors, efforts to preserve the world's cultural heritage in the form of archaic scripts, etc.

In fact, Unicode deals with more than character encoding. It covers a wide variety of issues related to text processing, broadly defined, including properties, algorithms, etc. The Unicode specifications cover hyphenation, line breaks, string comparison and sorting, locale-dependent number, date and time formatting, the display of right-to-left or bidirectional scripts, etc.

The Unicode standard catalogues and identifies characters. It does not define the shape, size or orientation of the printed characters. Refer to this post for the difference between an abstract character and its glyph. The Unicode standard defines the former but not the latter.

Finally, Unicode does not define text elements, such as words, sentences, etc. The definitions of the fundamental units of text depend on both the language in which they appear and the processes that manipulate them (e.g., hyphenation versus sorting). Unicode addresses the lower-level problem of character identification, on top of which the text elements and related processes are defined.

Since no single character encoding can support all basic text processes with optimal efficiency, the Unicode standard embodies trade-offs that are agreed upon by the large membership of the Unicode Consortium.


Goal


The overarching goal of the Unicode Consortium is to facilitate global interoperability and data interchange with simplified and cheaper software, with enough flexibility to be able to respond to changing demands from the global IT industry.

The approach used to achieve this goal is essentially to replace the large number of incomplete and inconsistent existing encodings with a single standard that encompasses all existing and future encodings. Such a unification aims to make it easier, cheaper and/or safer to develop internationalized software products, to eliminate most data-corruption incidents in translation processes, to lower the entry cost for developing countries, and to support the growth of various areas of knowledge in, for example, the mathematical or scientific communities.

In short, the Unicode standard describes an unambiguous, universal, efficient and flexible character encoding system that enables a wide variety of multi-language and/or technical text-based processes.


Design principles


The current Unicode standard reflects trade-offs that result from the adherence to several partially conflicting principles that aim to make the standard and/or its implementation:
  • Universal: The Unicode standard includes a single, very large repertoire of characters that should contain all of the characters needed for the textual representation in all modern (and most historic) writing systems, as well as symbols used in plain text. It should meet the needs of diverse user communities (e.g., business, scientific, etc.) within each language. Of course, this universality is bounded by the deliberate exclusion of insufficiently documented or standardized scripts, as well as non-textual writing systems. 
  • Efficient: Unicode text should be fast to process. This goal is realized by a variety of design choices. For example, Unicode does not contain any escape characters, all code points are equally accessible and their code units do not overlap (which speeds up search for and random access to characters in a stream thereof), the proximity of related characters in the code space speeds up compression algorithms, etc.
  • Character-Centric: The Unicode standard encodes characters, that is, the smallest, meaningful components of written language, such as letters, digits, punctuation marks, mathematical symbols, etc. Unicode represents these characters abstractly with code points (numbers). Unicode does not define the visual representation (or glyph) of these characters. So, Unicode deals almost exclusively with the three leftmost components (and the two associated mappings) in the following four-component framework (for details, refer to this post):

bit pattern\( \longleftrightarrow \)code point\( \longleftrightarrow \)(abstract)\( \longleftrightarrow \)glyph
(one or more code units)character


  • Semantically Explicit: In Unicode, all characters have well-defined semantics (meanings) that are defined through specific properties, rather than implied by the character's name or its position in a table. The Unicode standard identifies more than 100 distinct character properties, including numeric, casing, combination, and directionality properties. The list of available properties, while finite and well circumscribed, is even able to grow as needed (within limits).
  • Exclusively Plain-Text: Unicode characters denote plain text exclusively, that is, the content or meaning of each character, not its appearance. According to the standard:  "Plain text must contain enough information to permit the text to be rendered legibly, and nothing more."  Of course, stylistic information might be included as plain text characters within the character stream (e.g., as tags or other markup language entities), but the interpretation/rendering of this style information is out of the scope of Unicode (i.e., it is left to processes that operate outside of the encoding/decoding processes, such as font rendering).
  • Logically Ordered: Whenever possible, the order in which Unicode text is stored in memory corresponds to the order in which the Unicode text is typed in (e.g., keyboard input) or phonetically output. For numbers, the most significant digit always comes first, even when the numbers are displayed in different directions. 
  • Unified: The Unicode standard eliminates duplication by encoding letters, punctuation marks, etc., only once when they are shared by two or more languages.
  • Dynamically Composable: Complex characters (e.g., accented ones) can be obtained by the composition (superposition) of two or more simple characters, including accent-only characters. Furthermore, it is always possible to create new compositions out of existing characters. However, a complex character may also be pre-composed, that is, mapped to exactly one code point that encodes the whole character without composition. This principle is thus partially in conflict with the previous one, since some characters are encoded by two or more (so-called equivalent) sequences of Unicode characters.
  • Stable: Once a character is assigned a code point, that association cannot be modified. Furthermore, characters cannot be removed from the standard, nor can their names (textual description) be modified (even when they contain a typo!). Similarly, important properties (such as the decompositions discussed in the previous bullet point) are immutable.
  • Convertible: The Unicode standard is designed to make sure that character identity is always preserved when converting its encoding to other widely used or accepted standards.

Upcoming posts will dive into the technical details of the Unicode standard.




This blog post is based on Chapter 1, Section 2.1, and Section 2.2 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

Wednesday, May 18, 2016

Character Encodings: Framework and Terminology

<blog's birth story>
As a Computer Science professor, I read a good amount of technical documentation and end up having to take a lot of notes. Today I realized that I could turn the notes I take for myself into tutorials that might benefit not only my students but anyone who finds his or her way to this blog.
</blog's birth story>

For example, I am currently preparing to teach a Computer Networking class for the first time this coming fall. Since Java is the language that we teach and use most often in our curriculum, I am looking into Java networking basics so I can familiarize myself with the relevant classes in the Java API that I might use in the programming assignments for this networking class.

<this blog post's origin story>
While reading the code samples in the Custom Networking section of the Java tutorials at the Oracle site, I had to review the basic input/output API classes, including those that handle text I/O versus binary I/O, which in turn led me to think about how the Java Virtual Machine represents strings internally (i.e., UTF-16) and converts them as needed, for example, when reading or writing text files in another encoding (e.g., ASCII or  ISO 8859-1). This is when I realized I had never studied character encodings.
</this blog post's origin story>

In this post, I will describe a conceptual framework for character encodings and define some terminology that will underlie the discussion of Unicode, UTF-16, etc., in upcoming posts.

Definition 1:character is a minimal unit of meaning (e.g., a letter, a digit, etc.). So, by definition, a character is an abstract entity, e.g., the number five, which is different from the numeral or symbol '5' that depicts the character whose meaning is five.

Definition 2: A character set (or character repertoire) is a collection of characters. Examples of character sets include:
  1. the set of the ten single-digit numbers (zero through nine), 
  2. the set of the 26 letters in the Latin or Roman alphabet,
  3. the 128 characters in the ASCII encoding scheme, and
  4. the 256 characters in the extended (or high) ASCII encoding scheme.
Note that some character sets (or repertoires) are closed (i.e., they are fixed at creation time), such as ASCII or extended ASCII, while other character sets are open (i.e., characters can be added at later times without having to create a whole new character set to accommodate the extra characters; the set contains 'placeholders' for as yet unknown characters), such as Unicode.

Definition 3: A code point (or code position) is a numerical value that is associated with a character. For example, in a fictitious scheme, the values 1 and 5 could be assigned to the first and second characters in the Roman alphabet. The mapping from code points to characters is (one-to-one but) arbitrary. Of course, if the character "set" has a well-known or natural ordering (e.g., the order of the letters in the Roman alphabet), then the code points are typically chosen to reflect this ordering. For example, in ASCII, the code points 97 through 122 are assigned in increasing order to the 26 lowercase letters in the Roman alphabet.

Definition 4: A code space (or coded character set) is the collection of all code points for a given character set. Since there is a one-to-one mapping between a character set and its code space, the size of the code space is always equal to the cardinality of the character set. However, while a (character) set is not ordered, the code space is ordered by the usual "less-than" relation among numbers. For example, while the ASCII character set could be written as:

{ a, b,  ... , z, A, B, ... , Z, 0, 1, ... , 9, !, @, ... }

or  

{ A, B, ... , Z, 0, 1, ... , 9, a, b,  ... , z, !, @, ... }

or in many other ways, since the order of the set's elements does not matter, the unique ASCII code space is the sequence:

< 0, 1, 2, ... ,  127 >.

Definition 5: A code unit is a bit sequence that is chosen to encode a code point. Each code point in the code space is thus mapped to a unique code unit. For example, ASCII encodes each code point between 0 and 127 into a bit string between 0000000 and 1111111 using the standard binary, unsigned integer representation.

Definition 6: The code unit size is the number of bits in each code unit. For example, the code unit size of ASCII is 7.

Finally, we are ready for the main definition of this tutorial.

Definition 7: A character encoding is a complete scheme that specifies not only:
  • a character set (the characters to be encoded), 
  • a code space (the code points), and 
  • the set of bit patterns (the code units that correspond to the encoded characters), 
but also the mapping(s) between these sets.

The upshot of these definitions is that there are three levels of abstraction in this model, namely:
  • the abstract characters, 
  • the code points that are assigned to the characters, and
  • the bit patterns that encode the code points. 
These three levels of abstraction and the two mapping among them are depicted below:

bit pattern\( \longleftrightarrow \)code point\( \longleftrightarrow \)(abstract)
(one or more code units)character


An alternative, older approach to character encoding is to map each character directly to a bit pattern, as shown below:

bit pattern\( \longleftrightarrow \)(abstract)
(one or more code units)character

Now, let's give a name to this (historical) model, which is used in the ASCII, EBCDIC and ISO 8859-1 encoding schemes.

Definition 8: A character map is a direct mapping between characters and bit patterns. At least, this is the terminology used in (and by contrast to) the Unicode model, which uses the three-layer abstraction.

In this historical model, it is possible for different encoding schemes to share the same character set but use different mappings from characters to bit patterns, as is the case, for example, for some variants of extended ASCII and EBCDIC.

One advantage of the three-layer model is that it makes it possible to define a "universal" set of all possible characters that can then be encoded in a variety of ways, as was done with Unicode in the form of UTF-8, UTF-16, UTF-32, etc. In those encodings, a given character may map to a single code point but this code point can be mapped to different code units (or a different number of code units), thereby achieving different tradeoffs between the size of the character set and the space (number of bits) taken by a file or network packet. But that is a topic for a future post.

Before wrapping up, I should mention that there is one additional mapping at play, naming the one from (abstract) characters to their visual depiction.

Definition 9: A glyph is a visual or symbolic representation of a character.

We now have four levels of abstraction and three mappings among them, as depicted below:

bit pattern\( \longleftrightarrow \)code point\( \longleftrightarrow \)(abstract)\( \longleftrightarrow \)glyph
(one or more code units)character


While each character in a character set is mapped to a single code point (number) it may be depicted by one or more glyphs. For example, here are nine possible glyphs for the first (lowercase) letter in the Roman alphabet:

A-small glyphs.svg
By Wickey-nl - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=14823855

Things get even more complicated at this level.

For example, a sequence of “f” followed by “i” may be displayed with a single glyph, called an fi ligature in which the shapes of the two characters are merged together and the dot is missing from the “i”. On the other hand, the same visual representation of the fi ligature might be obtained by a sequence of two glyphs with the right shapes. Similarly, there are several options for depicting accented letters.

The choice of whether to use a single glyph or a sequence of two glyphs is determined by the selected font and the rendering software. But discussing this topic would take us too far afield from character encodings.


Here is my main source for this post. 

A more complete reference is the Unicode Technical Report #17.