Tuesday, May 24, 2016

Unicode Tutorial - Part 3: Character Encoding Form versus Character Encoding Scheme

Let's start by reviewing the general framework for encoding schemes described in this post:


bit pattern\( \longleftrightarrow \)code point\( \longleftrightarrow \)(abstract)
(one or more code units)character


in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding, as discussed in Part 2 of this tutorial. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In the previous post, we discussed the mapping from characters to code points. In this post, we focus on the leftmost mapping, that is, the way bit patterns are assigned to code points.

Since Unicode is an international standard for data interchange over computer networks, the code points, which are numbers, must be encoded as fixed-size bit patterns (e.g, octets or 8-bit bytes, 32-bit words, etc.) for computers to process, transmit and store. These fixed-size bit patterns are called code units and this fixed number of bits is called the code unit size (see this post for a refresher).

Each character encoding model is therefore characterized by 1) its code unit size and 2) the way each code point is encoded into (or mapped to) a sequence of one or more code units. In the Unicode standard, this mapping is called a "character encoding form" (the "character encoding scheme" is different and will be defined below).

Definition 1: A character encoding form (or CEF) is a mapping from the code space (i.e., the set of code points) to the set of sequences of code units.

In other words, in the Unicode standard, a CEF specifies how each integer that is a valid code point is represented as a bit string, which is the concatenation of one or more fixed-size code units.

The Unicode standard defines three distinct CEFs that use 8-bit, 16- bit, and 32-bit units and are named UTF-8, UTF-16, and UTF-32, respectively. The UTF acronym is a carryover from earlier terminology and stands for Unicode (or UCS) Transformation Format. Each one of UTF-8, UTF-16 and UTF-32 will be described in its own post in upcoming parts of this tutorial.

Note that each one of these three CEFs is a legitimate encoding of all Unicode characters. They differ not only in the bit strings that they assign to the code points and in their code unit size, but also in the number of code units in the sequences that are assigned to the code point. This leads us to define two kinds of CEFs.

Definition 2: A CEF has fixed width if and only if it maps all code points in the code space to (code unit) sequences of the same length. Otherwise, the CEF is said to have variable width.

For example, UTF-32 has fixed width because it maps all code points to a single 32-bit code unit. Similarly ISO/IEC 8859-1 has fixed width because it maps all code points in its repertoire to a single 8-bit code unit.

In contrast, UTF-16 has variable width because it maps some code points to one 16-bit code unit and others to a sequence of two 16-bit code units. Similarly, UTF-8 has variable width because the sequences that are mapped to code points may contain 1, 2, 3, or 4 8-bit code units.

Character encoding forms define a one-to-one mapping between code points and code unit sequences. As such, they are sufficient when dealing with characters stored in memory of any given computer. After all, a code point is just an integer and a code unit sequence is just a bit string. Any computer can handle those data types easily.

However, things are a bit more complicated when textual data must be stored in a file on disk or sent to another computer over a network. Since the basic unit of storage or communication is the octet or 8-bit byte (thereafter referred to simply as a byte), integers and bit strings that do not fit in one byte must be handled carefully. In particular, these bit strings must be serialized, that is, broken down into a sequence of bytes. 

Unfortunately, different computers (more specifically, processor architectures) may serialize multi-byte bit strings in different ways. For example, Little Endian architectures serialize bit strings from least significant to most significant (i.e., they put the "little end" first), while Big Endian architectures serialize bit strings from most significant to least significant (i.e., they put the "big end" first).

For example, the (made up) 3-byte sequence 0x2468AC would be serialized as  24 68 AC in Big Endian and as AC 68 24 in Little Endian.

This difference does not matter for operations on code unit sequences that take place within the memory of a given computer, since each computer consistently uses the same byte order for all of its internal operations. 

However, the byte order becomes important when encoded text (i.e., the code unit sequences of its characters) is communicated through data streams that connect to file storage or network communication links. To avoid corruption of the encoded text, the computers involved must agree on the byte serialization scheme.

This is where character encoding schemes come into play.

Definition 3: A character encoding scheme (or CES) specifies both a CEF and the way that the code unit sequences of the CEF are serialized into bytes. 

The Unicode standard specifies 3 character encoding forms (as discussed above) and 7 character encoding schemes, as follows:
  • UTF-32BE: UTF-32 plus Big Endian serialization
  • UTF-32LE: UTF-32 plus Little Endian serialization
  • UTF-32: Uses Big Endian by default (or a byte order mark, see below). In this case, the CEF and the CES have the same name, which must be disambiguated by the context.
  • UTF-16BE: UTF-16 plus Big Endian serialization
  • UTF-16LE: UTF-16 plus Little Endian serialization
  • UTF-16: Similar case to UTF-32 in the third bullet point above
  • UTF-8: Since each code point maps to a sequence of code units that are bytes already, there is no need to serialize the code units. The bytes are serialized in the same order as they appear in the code unit sequence.
Let's now consider some examples. keeping in mind that two hex characters (as used in the following examples) together correspond to one byte.

Example 1: In the UTF-8 scheme, the UTF-8 (CEF) code unit sequence
4D D0 B0 E4 BA 8C F0 90 8C 82 
is directly serialized as: 
4D D0 B0 E4 BA 8C F0 90 8C 82 

Example 2: In the UTF-16BE scheme, the UTF-16 code unit sequence
004D 0430 4E8C D800 DF02 
is serialized as: 
00 4D 04 30 4E 8C D8 00 DF 02 

Example 3: In the UTF-16LE scheme, the UTF-16 code unit sequence
004D 0430 4E8C D800 DF02 
is serialized as: 
4D 00 30 04 8C 4E 00 D8 02 DF 

Example 4: In the UTF-32BE scheme, the UTF-32 code unit sequence
0000004D 00000430 00004E8C 00010302 
is serialized as: 
00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02

Example 5: In the UTF-32LE scheme, the UTF-32 code unit sequence
0000004D 00000430 00004E8C 00010302 
is serialized as: 
4D 00 00 00 30 04 00 00 8C 4E 00 00 02 03 01 00

There is one more feature of encoding schemes that I'd like to mention. How do Unicode text processors determine the endianness of a piece of Unicode text when the byte order is not specified, as in the case of the UTF-16 and UTF-32 encoding schemes?

Definition 4: The Unicode standard defines a special character called the byte order mark or BOM that appears at the head of a data stream in order to unambiguously signal the byte order of the code units that follow.

For example, in the UTF-16 encoding scheme, an initial byte sequence "FE FF" is interpreted as a byte order mark indicating Big Endian order. In contrast, the initial byte sequence "FF FE" indicates Little Endian order. 

Note that the BOM is not part of the contents of the file or the data in the packet. U+FEFF and U+FFFE were chosen for this role because they are highly unlikely to represent encoded text. In fact, U+FFFE is a noncharacter and U+FEFF is the format character named ZERO WIDTH NO-BREAK SPACE (see Part 2 of this tutorial).

Let's conclude with an example involving the BOM.

Example 6: Consider the UTF-16 code unit sequence: D800 DF02 
In UTF-16BE, it is serialized as: D8 00 DF 02
In UTF-16LE, it is serialized as: 00 D8 02 DF 
But in UTF-16, it could be serialized as:
  • FE FF D8 00 DF 02 (Big Endian with BOM), 
  • FF FE 00 D8 02 DF (Little Endian with BOM), or
  • D8 00 DF 02 (Big Endian, no BOM, i.e., the default). 
To summarize this post, the mapping from code points to bit strings show above as:

bit pattern\( \longleftrightarrow \)code point
(one or more code units)

in Unicode is really a two-step mapping as follows:

byte sequence\( \stackrel{byte\ ordering}{\longleftrightarrow} \)code unit sequence\(\stackrel{CEF}{\longleftrightarrow} \)code point

Having covered the 7 character encoding schemes in sufficient detail, we will turn our attention back to the 3 character encoding forms in the next three parts of this tutorial.

This blog post is based on Sections 2.5, 2.6, 3.9 and 3.10 of The Unicode® Standard Version 8.0 – Core Specification (August 2015) and Unicode Technical Report #17 (2008).

No comments:

Post a Comment