The UW Oshkosh Computer Science Tutorial Series: Unicode Tutorial

Overview

[ Note that the definitions of (abstract) character, character encoding, repertoire, code point, and related terms used in this post can be found in this previous post. ]

Unicode is a universal encoding scheme for characters and, more generally, multilingual text. While the ASCII encoding scheme was based originally on the English language, Unicode aims to handle all past, present, and future languages and scripts. The character codes defined in the Unicode standard are identical to those in the UCS or ISO/IEC 10646 international standard. Unicode is also backward-compatible with ASCII through its UTF-8 encoding form. Unicode is the default encoding for HTML and XML. It is implemented in all modern operating systems and supported by many programming languages (e.g., Java, JavaScript, C#).

Coverage

The Unicode encoding aims to replace the many existing encodings (e.g., see this list of character sets at the IANA web site) with a single, universal encoding standard.

The Unicode standard treats all characters equivalently, be they letters, digits, ideographic characters, mathematical symbols, etc. Its all-inclusive repertoire was designed to be able to assign a code point to each one of more than one million characters. As of August 2015, Unicode Version 8.0 assigns code points to 120,672 characters, thereby covering essentially all of the world's scripts. Most of the code points are not assigned yet, which allows the Unicode Consortium to accommodate requests for new characters motivated by changing industry needs, new scholarly endeavors, efforts to preserve the world's cultural heritage in the form of archaic scripts, etc.

In fact, Unicode deals with more than character encoding. It covers a wide variety of issues related to text processing, broadly defined, including properties, algorithms, etc. The Unicode specifications cover hyphenation, line breaks, string comparison and sorting, locale-dependent number, date and time formatting, the display of right-to-left or bidirectional scripts, etc.

The Unicode standard catalogues and identifies characters. It does not define the shape, size or orientation of the printed characters. Refer to this post for the difference between an abstract character and its glyph. The Unicode standard defines the former but not the latter.

Finally, Unicode does not define text elements, such as words, sentences, etc. The definitions of the fundamental units of text depend on both the language in which they appear and the processes that manipulate them (e.g., hyphenation versus sorting). Unicode addresses the lower-level problem of character identification, on top of which the text elements and related processes are defined.

Since no single character encoding can support all basic text processes with optimal efficiency, the Unicode standard embodies trade-offs that are agreed upon by the large membership of the Unicode Consortium.

Goal

The overarching goal of the Unicode Consortium is to facilitate global interoperability and data interchange with simplified and cheaper software, with enough flexibility to be able to respond to changing demands from the global IT industry.

The approach used to achieve this goal is essentially to replace the large number of incomplete and inconsistent existing encodings with a single standard that encompasses all existing and future encodings. Such a unification aims to make it easier, cheaper and/or safer to develop internationalized software products, to eliminate most data-corruption incidents in translation processes, to lower the entry cost for developing countries, and to support the growth of various areas of knowledge in, for example, the mathematical or scientific communities.

In short, the Unicode standard describes an unambiguous, universal, efficient and flexible character encoding system that enables a wide variety of multi-language and/or technical text-based processes.

Design principles

The current Unicode standard reflects trade-offs that result from the adherence to several partially conflicting principles that aim to make the standard and/or its implementation:

Universal: The Unicode standard includes a single, very large repertoire of characters that should contain all of the characters needed for the textual representation in all modern (and most historic) writing systems, as well as symbols used in plain text. It should meet the needs of diverse user communities (e.g., business, scientific, etc.) within each language. Of course, this universality is bounded by the deliberate exclusion of insufficiently documented or standardized scripts, as well as non-textual writing systems.

Efficient: Unicode text should be fast to process. This goal is realized by a variety of design choices. For example, Unicode does not contain any escape characters, all code points are equally accessible and their code units do not overlap (which speeds up search for and random access to characters in a stream thereof), the proximity of related characters in the code space speeds up compression algorithms, etc.

Character-Centric: The Unicode standard encodes characters, that is, the smallest, meaningful components of written language, such as letters, digits, punctuation marks, mathematical symbols, etc. Unicode represents these characters abstractly with code points (numbers). Unicode does not define the visual representation (or glyph) of these characters. So, Unicode deals almost exclusively with the three leftmost components (and the two associated mappings) in the following four-component framework (for details, refer to this post):

bit pattern	$\longleftrightarrow$	code point	$\longleftrightarrow$	(abstract)	$\longleftrightarrow$	glyph
(one or more code units)				character

Semantically Explicit: In Unicode, all characters have well-defined semantics (meanings) that are defined through specific properties, rather than implied by the character's name or its position in a table. The Unicode standard identifies more than 100 distinct character properties, including numeric, casing, combination, and directionality properties. The list of available properties, while finite and well circumscribed, is even able to grow as needed (within limits).

Exclusively Plain-Text: Unicode characters denote plain text exclusively, that is, the content or meaning of each character, not its appearance. According to the standard: "Plain text must contain enough information to permit the text to be rendered legibly, and nothing more." Of course, stylistic information might be included as plain text characters within the character stream (e.g., as tags or other markup language entities), but the interpretation/rendering of this style information is out of the scope of Unicode (i.e., it is left to processes that operate outside of the encoding/decoding processes, such as font rendering).

Logically Ordered: Whenever possible, the order in which Unicode text is stored in memory corresponds to the order in which the Unicode text is typed in (e.g., keyboard input) or phonetically output. For numbers, the most significant digit always comes first, even when the numbers are displayed in different directions.

Unified: The Unicode standard eliminates duplication by encoding letters, punctuation marks, etc., only once when they are shared by two or more languages.

Dynamically Composable: Complex characters (e.g., accented ones) can be obtained by the composition (superposition) of two or more simple characters, including accent-only characters. Furthermore, it is always possible to create new compositions out of existing characters. However, a complex character may also be pre-composed, that is, mapped to exactly one code point that encodes the whole character without composition. This principle is thus partially in conflict with the previous one, since some characters are encoded by two or more (so-called equivalent) sequences of Unicode characters.

Stable: Once a character is assigned a code point, that association cannot be modified. Furthermore, characters cannot be removed from the standard, nor can their names (textual description) be modified (even when they contain a typo!). Similarly, important properties (such as the decompositions discussed in the previous bullet point) are immutable.

Convertible: The Unicode standard is designed to make sure that character identity is always preserved when converting its encoding to other widely used or accepted standards.

Upcoming posts will dive into the technical details of the Unicode standard.

This blog post is based on Chapter 1, Section 2.1, and Section 2.2 of The Unicode® Standard Version 8.0 – Core Specification (August 2015).

The UW Oshkosh Computer Science Tutorial Series

Friday, May 20, 2016

Unicode Tutorial - Part 1: Overview

Overview

Coverage

Goal

Design principles

No comments:

Post a Comment