The UW Oshkosh Computer Science Tutorial Series: Unicode Tutorial

Let's start by reviewing the general framework for encoding schemes described in this post:

bit pattern	$\longleftrightarrow$	code point	$\longleftrightarrow$	(abstract)
(one or more code units)				character

in which the rightmost arrow indicates the mapping between the repertoire (set of abstract characters) and the code space (set of numbers or code points) of the encoding. This mapping is one-to-one: each abstract character is assigned a unique code point and each code point is assigned to exactly one character. The leftmost mapping, on the other hand, represents the fact that each code point is represented as a sequence of bits (made of one or more code units) in the computer's memory.

In this post, we focus on the rightmost mapping, that is, the way code points are assigned to characters.

In the Unicode standard, the code space contains 1,114,112 numbers, namely the integers from 0 to 1,114,111. Therefore, the cardinality of the character set of Unicode is also 1,114,112. To make the code points a bit shorter, we will write them using the hexadecimal notation. So, the code points in Unicode range from 0x000000 to 0x10FFFF, where '0x' is the conventional prefix to indicate that the following string is a number represented in hex notation.

Actually, the conventional prefix for Unicode code points is 'U+'. Therefore, each character in the Unicode repertoire can be identified by its code point, denoted U+xxxxxx, where each 'x' is an hex digit: 0, 1, 2, ... , 9, A, ... , E or F, keeping in mind that the two leftmost (most significant) hex digits can only be 00, 01, 02, ... , 09, 0A, ... , 0F, or 10. When the leading digits in this group of 2 are 0s, they are usually omitted. For example, the code point for the upper case A, namely U+000041, is typically written as U+0041. The reason why the remaining two leading 0s in U+0041 are not dropped (to yield U+41, which would be even more concise) is yet another convention. Since the 65,536 most common characters worldwide are grouped together at the beginning of the code space, their code points (from 0 to 65,535) are conventionally represented by their 4 least significant hex digits, from U+0000 to U+FFFF.

In addition to a unique code point (a number), each character in the Unicode repertoire is assigned a unique official Unicode name. For example, the Unicode name for U+0041 is "LATIN CAPITAL LETTER A'. Another example is



with code point U+1F47D and name "ALIEN, EXTRATERRESTRIAL". Here is a complete list of code points and names, as of Version 8 (August 2015). Furthermore, this site gives many details on each character (note that the code point of the character is part of the URL).

The Unicode standard lists 7 types of code points (see this list of Unicode character categories):

graphic code points are for characters that are meant to be visible when displayed, that is, those that are associated with at least one glyph. There are 6 general categories of graphic characters, namely:

letters (L category)
marks (M), e.g., accents, tilde, etc.
numbers (N)
punctuation (P)
symbols (S), e.g., mathematical symbols such as +, currency symbols such as $
spaces (Zs), e.g., whitespace character, non-breaking space, etc.

format code points are for invisible characters that affect neighboring characters, e.g., line breaks, paragraph breaks, left-to-right and right-to-left marks (used to set the way adjacent characters are grouped with respect to text direction), etc.
control code points are for characters that are invisible and take no space. They are not used by Unicode but rather are meaningful to and interpreted by other standards, protocols, programming languages, etc., e.g., the null character that is interpreted as the string terminator in the C programming language, or the carriage return, line feed, and tabulation characters.
private-use code points are not assigned any characters by the Unicode standard because they are meant for third parties to define their own characters while guaranteeing that they will not conflict with any Unicode code point assignments.
surrogate code points are not assigned to any abstract characters because each one of them is used as part of a pair of code units in the UTF-16 encoding scheme to be described in a later part of this tutorial. No one of these 2,048 code points can be interchanged by itself. Only surrogate pairs have meaning in the standard.
noncharacter code points are reserved for internal use only and are not assigned any characters.
reserved code points are not (yet) assigned to abstract characters but they are (unlike surrogate code points) assignable at a later time, when needed. These code points embody the open nature of the Unicode repertoire.

In Version 8.0 of the Unicode standard, the total number of graphic characters is 120,520. Adding 152 format characters, 65 control characters, and 137,468 private-use characters yields a total of 258,205 assigned characters. After adding the surrogate code points and the 66 noncharacters, a total of 260,319 code points are currently being used, The standard calls these designated code points.

This leaves 853,793 unused code points, out of a total of 1,114,112. In other words, 76.63% of the code space is left unused to accommodate future needs.

To get a sense of how slowly the set of designated code points has grown over time, here is how this number has evolved over time:

	V5.2	V6.0	V6.1	V6.2	V6.3	V7.0	V8.0
	2009	2010	2012	2012	2013	2014	2015
# of designated	246,943	249,031	249,763	249,764	249,769	252,603	260,319
code points

And here it is as a line graph since V1.0 released in 1991:

Note the linear scale on the vertical axis and recall that the size of the Unicode code space is 1,114,112.

Before I wrap up this post, let me mention another way to look at the code space.

So far, we have looked at the code points by type. However, not all of the code points of a given type are located contiguously in the code space. This is because code points are grouped together in the code space based on their usage, rather than by type. With some exceptions (e.g., all of the ASCII characters are grouped at the bottom of the code space, even though they may be used in several languages), characters that are often used together are near each other in the code space.

In particular, all of the code points for a given language are typically grouped together. So all of the letters of the Latin alphabet are found in one contiguous group and all of the letters of the Greek alphabet are found in another contiguous group. But these two groups of letters are separated by many non-letter characters. For example, the left and right curly brackets ('{' and '}'), the copyright symbol, and the tilde ('~') all have code points that are located between the last letter of the Latin alphabet and the first letter of the Greek alphabet.

In short, code points are organized in Unicode groups, which can be visualized in these charts produced by the Unicode Consortium.

Furthermore, each group is located within one plane of the code space, where a Unicode plane is a contiguous group of 65,536 code points. Since the code space ranges from U+000000 and U+10FFFF, it can naturally be broken down into 17 planes indexed from 0x00 to 0x10 (the two most significant or leftmost hex digits in a code point), with each plan containing 65,536 code points ranging from 0x0000 to 0xFFFF (the four least significant or rightmost hex digits in a code point).

For example, the character U+023456 is located at position 0x3456 in plane number 0x02.

Plane 0 is called the Basic Multilingual Plane or BMP. It contains the most common characters for all of the modern scripts in the world. Therefore, the vast majority of all Unicode characters used in all textual data is located in the BMP.

Plane 1 is called the Supplementary Multilingual Plane or SMP. It contains the characters for scripts or symbols that either did not fit into the BMP or are seldom used, including many historic scripts.

Plane 2 is called the Supplementary Ideographic Plane or SIP. It contains those CJK characters that could not be fit in the BMP, namely the CJK (Chinese, Japanese and Korean) Unified Ideographs Extension B.

Planes 3 through 13 are completely unused in Version 8.0. They are reserved for future use in support of the universal (and thus open) nature of Unicode.

Plane 14 is called the Supplementary Special-purpose Plane or SSP. It contains only two small blocks of non-graphic characters.

Planes 15 and 16 are called the Supplementary Private Use Areas A and B (respectively). They contain private-use characters, as described above.

Let me close by referring you to a visualization of the Unicode code space at one pixel per code point posted at Ryan Flynn's blog.

This blog post is based on Sections 2.4, 2.5, 2.6, 2.8, 2.9 and D.1 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

The UW Oshkosh Computer Science Tutorial Series

Sunday, May 22, 2016

Unicode Tutorial - Part 2: Code space

No comments:

Post a Comment