The UW Oshkosh Computer Science Tutorial Series: Character Encodings: Framework and Terminology

<blog's birth story>
As a Computer Science professor, I read a good amount of technical documentation and end up having to take a lot of notes. Today I realized that I could turn the notes I take for myself into tutorials that might benefit not only my students but anyone who finds his or her way to this blog.
</blog's birth story>

For example, I am currently preparing to teach a Computer Networking class for the first time this coming fall. Since Java is the language that we teach and use most often in our curriculum, I am looking into Java networking basics so I can familiarize myself with the relevant classes in the Java API that I might use in the programming assignments for this networking class.

<this blog post's origin story>
While reading the code samples in the Custom Networking section of the Java tutorials at the Oracle site, I had to review the basic input/output API classes, including those that handle text I/O versus binary I/O, which in turn led me to think about how the Java Virtual Machine represents strings internally (i.e., UTF-16) and converts them as needed, for example, when reading or writing text files in another encoding (e.g., ASCII or ISO 8859-1). This is when I realized I had never studied character encodings.
</this blog post's origin story>

In this post, I will describe a conceptual framework for character encodings and define some terminology that will underlie the discussion of Unicode, UTF-16, etc., in upcoming posts.

Definition 1: A character is a minimal unit of meaning (e.g., a letter, a digit, etc.). So, by definition, a character is an abstract entity, e.g., the number five, which is different from the numeral or symbol '5' that depicts the character whose meaning is five.

Definition 2: A character set (or character repertoire) is a collection of characters. Examples of character sets include:

the set of the ten single-digit numbers (zero through nine),
the set of the 26 letters in the Latin or Roman alphabet,
the 128 characters in the ASCII encoding scheme, and
the 256 characters in the extended (or high) ASCII encoding scheme.

Note that some character sets (or repertoires) are closed (i.e., they are fixed at creation time), such as ASCII or extended ASCII, while other character sets are open (i.e., characters can be added at later times without having to create a whole new character set to accommodate the extra characters; the set contains 'placeholders' for as yet unknown characters), such as Unicode.

Definition 3: A code point (or code position) is a numerical value that is associated with a character. For example, in a fictitious scheme, the values 1 and 5 could be assigned to the first and second characters in the Roman alphabet. The mapping from code points to characters is (one-to-one but) arbitrary. Of course, if the character "set" has a well-known or natural ordering (e.g., the order of the letters in the Roman alphabet), then the code points are typically chosen to reflect this ordering. For example, in ASCII, the code points 97 through 122 are assigned in increasing order to the 26 lowercase letters in the Roman alphabet.

Definition 4: A code space (or coded character set) is the collection of all code points for a given character set. Since there is a one-to-one mapping between a character set and its code space, the size of the code space is always equal to the cardinality of the character set. However, while a (character) set is not ordered, the code space is ordered by the usual "less-than" relation among numbers. For example, while the ASCII character set could be written as:

{ a, b, ... , z, A, B, ... , Z, 0, 1, ... , 9, !, @, ... }

or

{ A, B, ... , Z, 0, 1, ... , 9, a, b, ... , z, !, @, ... }

or in many other ways, since the order of the set's elements does not matter, the unique ASCII code space is the sequence:

< 0, 1, 2, ... , 127 >.

Definition 5: A code unit is a bit sequence that is chosen to encode a code point. Each code point in the code space is thus mapped to a unique code unit. For example, ASCII encodes each code point between 0 and 127 into a bit string between 0000000 and 1111111 using the standard binary, unsigned integer representation.

Definition 6: The code unit size is the number of bits in each code unit. For example, the code unit size of ASCII is 7.

Finally, we are ready for the main definition of this tutorial.

Definition 7: A character encoding is a complete scheme that specifies not only:

a character set (the characters to be encoded),
a code space (the code points), and
the set of bit patterns (the code units that correspond to the encoded characters),

but also the mapping(s) between these sets.

The upshot of these definitions is that there are three levels of abstraction in this model, namely:

the abstract characters,
the code points that are assigned to the characters, and
the bit patterns that encode the code points.

These three levels of abstraction and the two mapping among them are depicted below:

bit pattern	\( \longleftrightarrow \)	code point	\( \longleftrightarrow \)	(abstract)
(one or more code units)				character

An alternative, older approach to character encoding is to map each character directly to a bit pattern, as shown below:

bit pattern	\( \longleftrightarrow \)	(abstract)
(one or more code units)	\( \longleftrightarrow \)	character

Now, let's give a name to this (historical) model, which is used in the ASCII, EBCDIC and ISO 8859-1 encoding schemes.

Definition 8: A character map is a direct mapping between characters and bit patterns. At least, this is the terminology used in (and by contrast to) the Unicode model, which uses the three-layer abstraction.

In this historical model, it is possible for different encoding schemes to share the same character set but use different mappings from characters to bit patterns, as is the case, for example, for some variants of extended ASCII and EBCDIC.

One advantage of the three-layer model is that it makes it possible to define a "universal" set of all possible characters that can then be encoded in a variety of ways, as was done with Unicode in the form of UTF-8, UTF-16, UTF-32, etc. In those encodings, a given character may map to a single code point but this code point can be mapped to different code units (or a different number of code units), thereby achieving different tradeoffs between the size of the character set and the space (number of bits) taken by a file or network packet. But that is a topic for a future post.

Before wrapping up, I should mention that there is one additional mapping at play, naming the one from (abstract) characters to their visual depiction.

Definition 9: A glyph is a visual or symbolic representation of a character.

We now have four levels of abstraction and three mappings among them, as depicted below:

bit pattern	\( \longleftrightarrow \)	code point	\( \longleftrightarrow \)	(abstract)	\( \longleftrightarrow \)	glyph
(one or more code units)				character

While each character in a character set is mapped to a single code point (number) it may be depicted by one or more glyphs. For example, here are nine possible glyphs for the first (lowercase) letter in the Roman alphabet:

By Wickey-nl - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=14823855

Things get even more complicated at this level.

For example, a sequence of “f” followed by “i” may be displayed with a single glyph, called an fi ligature in which the shapes of the two characters are merged together and the dot is missing from the “i”. On the other hand, the same visual representation of the fi ligature might be obtained by a sequence of two glyphs with the right shapes. Similarly, there are several options for depicting accented letters.

The choice of whether to use a single glyph or a sequence of two glyphs is determined by the selected font and the rendering software. But discussing this topic would take us too far afield from character encodings.

Here is my main source for this post.

A more complete reference is the Unicode Technical Report #17.

Finally, here are some introductory articles on this topic that might be helpful before moving on to the upcoming posts:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, by David C. Zentgraf

The UW Oshkosh Computer Science Tutorial Series

Wednesday, May 18, 2016

Character Encodings: Framework and Terminology

No comments:

Post a Comment