The UW Oshkosh Computer Science Tutorial Series: Unicode Tutorial

In Part 3 of this tutorial, we distinguished Unicode encoding forms (namely, UTF-8, UTF-16, and UTF-32) from Unicode encoding schemes (namely, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE).

This post deals exclusively with UTF-32, both the encoding form and the encoding scheme.

Recall that the UTF-32 encoding form has fixed width because it maps each code point in the Unicode coded character set to a single 32-bit code unit. Since the number of Unicode characters is 1,114,112, each code point is a number between 0x000000 and 0x10FFFF and thus fits within 21 bits.

Since a single 32-bit integer can represent a code point with 11 bits to spare, UTF-32 is the simplest of all Unicode encoding forms: it encodes each code point value as a 32-bit integer with the 11 leading bits set to 0. For example, the code point 97 (in decimal), that is, 0x61, is represented by the code unit:

00000000 00000000 00000000 01100001

Note that any UTF-32 code unit whose value is greater than 0x0010FFFF is ill-formed (too large). Furthermore, because surrogate values used in UTF-16 (to be discussed in the next part of this tutorial) are not by themselves valid code points, UTF-32 code units in the range 0x00D800 through 0x00DFFF are also ill-formed.

The rest of this post will use the Java programming language to manipulate character encodings in general and UTF-32 more specifically.

The following Java program uses the availableCharsets() and defaultCharset() static methods of the Charset class to list all of the available character encodings as well as the default charset on the machine on which it is executed.

Note that the former method returns a SortedMap data structure. On lines 14-19, I loop through the key-value pairs in this structure to print the canonical name of the Charset (the key of the pair) followed by the list of aliases of the Charset (the value of the pair).

import java.nio.charset.Charset;
import java.util.Map;
import java.util.SortedMap;

class CharacterEncodings {
    public static void main(String[] args) {
        Charset defaultCharset = Charset.defaultCharset();
        String defaultCharsetName = defaultCharset.displayName();
        SortedMap<String,Charset> availableCharsets = Charset.availableCharsets();
        System.out.println("Your JVM supports the following " + 
                           availableCharsets.size() +
                           " character sets,\nwith the default" +
                           " one marked by asterisks:\n");
        for(Map.Entry<String,Charset> entry : availableCharsets.entrySet()) {
            if (defaultCharsetName.equalsIgnoreCase(entry.getValue().displayName())) {
                System.out.print(" ( * * * * * * * * * * * * * * * * * * * * ) ");
            }
            System.out.println(entry.getKey() + " " + entry.getValue().aliases());
        }
    }// main method   
}// CharacterEncodings class

The (truncated) output of this program is shown below. While the UTF-32 encoding is available on my system, it is not the one used by default (e.g., when writing text files). Instead, UTF-8 is the default, which is not surprising since it is a common encoding for text files, for HTML files, etc. In fact, UTF-8 is the dominant encoding on the web.

Your JVM supports the following 169 character sets,
with the default one marked by asterisks:

Big5 [csBig5]
...
TIS-620 [tis620.2533, tis620]
US-ASCII [cp367, ascii7, ISO646-US, 646, csASCII, us, iso_646.irv:1983, 
          ISO_646.irv:1991, IBM367, ASCII, default, ANSI_X3.4-1986, 
          ANSI_X3.4-1968, iso-ir-6]
UTF-16 [utf16, UnicodeBig, UTF_16, unicode]
UTF-16BE [X-UTF-16BE, UTF_16BE, ISO-10646-UCS-2, UnicodeBigUnmarked]
UTF-16LE [UnicodeLittleUnmarked, UTF_16LE, X-UTF-16LE]
UTF-32 [UTF32, UTF_32]
UTF-32BE [X-UTF-32BE, UTF_32BE]
UTF-32LE [X-UTF-32LE, UTF_32LE]
 ( * * * * * * * * * * * * * * * * * * * * ) UTF-8 [unicode-1-1-utf-8, UTF8]
windows-1250 [cp1250, cp5346]
...
x-windows-iso2022jp [windows-iso2022jp]

Note that the output of this program will most likely be different on your machine. For example, the default charset (indicated by asterisks above) is determined during the startup phase of the JVM and typically depends on the locale and the charset of your operating system.

In the following program, I start with an array of 60 bytes (lines 7 through 23). This array contains the Unicode code points of 15 characters (see the comment at the end of each line), with 4 bytes per code point. The main program then converts (or "decodes") this byte array into an array of 15 Unicode characters.

First, on line 35, I create a Charset instance for the UTF-32 encoding using the static forName() method. Then, on line 39, I invoke the decode() instance method that takes in a ByteBuffer (that simply wraps the byte[] object) and returns a CharBuffer containing 15 Java characters, which is then converted to a Java string using its toString() method and finally printed, producing the following output:

Part 1:
©Crêpes à gogo🍴

import java.nio.charset.Charset;
import java.nio.CharBuffer;
import java.nio.ByteBuffer;

class UTF32 {

    static byte[] bytes  = { 
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xA9, // copyright
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x43, // C
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x72, // r
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xEA, // e with ^
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x70, // p
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x65, // e
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x73, // s
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x20, // " "
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0xE0, // a with `
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x20, // " " 
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x67, // g
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x6F, // o
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x67, // g
      (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x6F, // o
      (byte) 0x00, (byte) 0x01, (byte) 0xF3, (byte) 0x74  // knife & fork
    };

    // this method (for fun) uses a classic trick to swap two array 
    // elements without using a temporary variable; note that it only 
    // works if i != j 
    static void swapByteArrayElements(byte[] a, int i, int j) {
        a[i] = (byte) (a[i] + a[j]);
        a[j] = (byte) (a[i] - a[j]);
        a[i] = (byte) (a[i] - a[j]);
    }// swapByteArrayElements method

    public static void main(String[] args) throws Exception {
        Charset utf32charset = Charset.forName("UTF-32");

        // part 1
        String str = 
             utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 1:\n" + str);

        // part 2
        bytes[2] = (byte) 0xFE;
        bytes[3] = (byte) 0xFF;
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 2:\n" + str);
 
        // part 3
        bytes[0] = (byte) 0xFF;
        bytes[1] = (byte) 0xFE;
        bytes[2] = (byte) 0x00;
        bytes[3] = (byte) 0x00;
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 3:\n" + str);
 
        // part 4
        for(int i=4; i<bytes.length-3; i+=4) {
            swapByteArrayElements(bytes,i,i+3);
            swapByteArrayElements(bytes,i+1,i+2);
        }
        str = utf32charset.decode(ByteBuffer.wrap(bytes)).toString();
        System.out.println("Part 4:\n" + str);
    }// main method   
}// UTF32class

Before discussing the rest of this program, it is important to note that Java internally encodes each Character/char in UTF-16 (which will be discussed in the next part of this tutorial) and each String object as an array of such characters.

Therefore, UTF-16 is the "native" representation of strings in Java. For this reason, when reading text in other representations, Java has to "decode" these external strings that use another encoding into its internal representation.

Conversely, when converting a Java string to another encoding, Java must "encode" the string using an external encoding.

In our case, the external encoding is UTF-32 because I represented the "external" or "non-native" string as a byte array in which four consecutive bytes encode the code point of a single Unicode character.

Back to line 39, utf32charset.decode(ByteBuffer.wrap(bytes)) converts the ByteBuffer (passed as argument to the decode method) to a CharBuffer (returned by the decode method) using the character encoding UTF-32 that the utf32charset object represents.

This is where the conversion or change of encoding takes place. Again, the returned CharBuffer object contains a (native) array of Java chars, each of which is a UTF-16 encoded character. This array can now be converted to a string (with a call to the toString()) method for printing purposes.

This concludes our discussion of Part 1 of this program.

Before we discuss the last three parts of this program, recall that strings in the UTF-32 encoding scheme may or may not begin with a "byte order mark" or BOM. When there is no BOM at the beginning of the input ByteBuffer as in Part 1 above, the byte order of the UTF-32 encoding scheme is assumed to be big-endian by default, according to which the most significant byte of each four-byte integer appears first.

In other words, the four bytes 0x00 0x00 0x00 0xA9 represent the integer 0x000000A9. With the little-endian scheme, these four bytes would represent the integer 0xA9000000 instead.

Part 1 of the program above relies on this default (big-endian) byte ordering. In Part 2 (lines 43 and 44), I replace the first four bytes of the ByteBuffer with the values:

0x00 0x00 0xFE 0xFF

This is the code point for the BOM, namely the special character that denotes the big-endian byte order. Since this character replaces the first character, the copyright symbol is erased. Furthermore, since the BOM, when present, is not part of the string itself (it is like meta-data used to describe the string, namely its byte order), the output for Part 2 is as follows:

Part 2:
Crêpes à gogo🍴

In this output, the characters are displayed correctly because the BOM explicitly states that the byte order is big-endian and the following 4-byte integers are indeed listed in big-endian order.

In Part 3, I first replace (lines 49 through 52) the first four bytes with the values:

0xFF 0xFE 0x00 0x00

Note that these values are simply the same bytes as the BOM above but in reverse order. In fact, they represent the byte order mark for the little-endian scheme. This means that these first four bytes are not going to be part of the decoded string. However, their presence will force the JVM to swap the order of all of the following groups of 4 bytes.

For example, the four bytes of the first actual character in the ByteArray, namely;

0x00 0x00 0x00 0x43

representing the character 'C' will be reordered to:

0x43 0x00 0x00 0x00

and then encoded in UTF-32. However, these reordered four bytes now encode an invalid code point, since the largest possible code point is 0x0010FFFF. This will happen for all 14 characters in the ByteArray (not counting the initial BOM), yielding the following garbage output:

Part 3:
��������������

in which the � character is used by default to replace characters with an invalid code point.

To fix this, in Part 4, I keep the little-endian byte order mark, but reorder the following bytes in groups of 4 (lines 57 through 60). After this code is executed, the ByteBuffer looks like this:

(byte) 0xFF, (byte) 0xFE, (byte) 0x00, (byte) 0x00, // used to be copyright
(byte) 0x43, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be C
(byte) 0x72, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be r
(byte) 0xEA, (byte) 0x00, (byte) 0x00, (byte) 0x00, // used to be e with ^
etc.

With this preliminary swap completed, the last output produced by this program is thus:

Part 4:
Crêpes à gogo🍴

Note that the copyright symbol has been replaced by the BOM (which is not part of the string and thus not displayed) and that the following characters are correct because now the BOM explicitly states that the byte order is little-endian and the following 4-byte integers are indeed listed in little-endian order.

To summarize:

Part 1 represents the "external" string (that is, the byte array) encoded in UTF-32 with no BOM and thus a big-endian default byte order.
Part 2 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the big-endian byte order on the rest of the bytes.
Part 3 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the little-endian byte order on the rest of the bytes but with the following bytes listed in big-endian order.
Part 4 represents the "external" string (that is, the byte array) encoded in UTF-32 with a BOM that explicitly imposes the little-endian byte order on the rest of the bytes, which is the correct order.

Therefore, only Part 3 of the code produces unwanted output.

The UW Oshkosh Computer Science Tutorial Series

Friday, May 27, 2016

Unicode Tutorial - Part 4: UTF-32, with Java code samples

No comments:

Post a Comment