Running on Java 24-ea+28-3562 (Preview)
Home of The JavaSpecialists' Newsletter

209Unicode Redux (1 of 2)

Author: Dr. Wolfgang LaunDate: 2013-04-11Java Version: 5Category: Tips and Tricks
 

Abstract: Unicode is the most important computing industry standard for representation and handling of text, no matter which of the world's writing systems is used. This newsletter discusses some selected features of Unicode, and how they might be dealt with in Java.

 

Welcome to the 209th issue of The Java(tm) Specialists' Newsletter, written in Vienna by our friend Dr. Wolfgang Laun who works for Thales Austria GmbH. I went to the CONFESS Java conference in Vienna last week together with my 11 year old daughter as a travel companion. Imagine our surprise when we arrived on the 3rd of April to a snow covered Vienna! Wolfgang had warned me that it was c-o-l-d, but I thought he was exaggerating. Our first port of call was to buy some trekking boots for my girl to keep her feet warm. We are not used to sub-zero temperatures in Crete. On Thursday morning, Wolfgang took us on an interesting walk around Vienna, showing us the original area of the Roman fort, then the architectures through the various eras - Romanesque, Gothic, baroque, etc. We learned a lot about Vienna and saw things that most tourists would miss, such as jutting out stones in narrow passages to keep the wagon wheels away from the wall.

Internationalization is tricky, due to the inexact nature of human communication. I'll never forget the time I tried to transfer £502.46 to my printers. Unfortunately my European banking system discarded the "." and initiated a transfer of £50246! Here in Europe they use the comma as a decimal point and the dot as a thousand separator. Fortunately I was able to cancel the transfer before it had a chance to go very far. Similarly, exchanging text can be surprisingly tricky. Thanks Wolfgang for taking your time to write this article on Unicode for us. I certainly learned a lot from it.

Administrative: We have moved over to a new mailing list, powered by Infusionsoft. Most links in my newsletters will from now on start with "https://iw127.infusionsoft.com/". Don't be alarmed.

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

Unicode Redux (1 of 2)

The first Unicode standard was published in 1991, shortly after the Java project was started. A 16-bit design was considered sufficient to encompass the characters of all the world's living languages. Unicode 2.0, which no longer restricted codepoints to 16 bits, appeared in 1996, but Java's first release had emerged the year before. Java had to follow suit, but char remained a 16-bit type. This article reviews several topics related to character and string handling in Java.

Quiz

  1. What is the minimum number of keys on a keyboard you need for typing any Java program?
  2. How many lines does this statement print? System.out.println("To be or not to be\u000Athat is here the question");
  3. How can you represent your system's line terminator within a string literal using Unicode escapes (\uHHHH)?
  4. How many different identifiers of length two can you use in a Java program?
  5. Given that Character.MIN_VALUE is 0 and Character.MAX_VALUE is 65535, how many different Unicode characters can be represented by a char variable?
  6. How can you obtain the 5th Unicode code point from a (sufficiently long) String value?
  7. Given that String s has length 1, is the result of s.toUpperCase() always the same as String.valueOf(Character.toUpperCase(s.charAt(0)))?

Introduction

The use of "characters" in Java isn't quite as simple as the type char might suggest; several misconceptions prevail. Notice that the word "character" goes back to the Greek word "χαράζω" (i.e., to scratch, engrave) which may be the reason why so many scratch their head over the resulting intricacies.

Several issues need to be covered, ranging from the representation of Java programs to the implementation of the data types char and java.lang.String, and the handling of character data during input and output.

"[Java] programs are written using the Unicode character set." (Language specification, § 3.1) This simple statement is followed by some small print, explaining that each Java SE platform relates to one of the evolving Unicode specifications, with SE 5.0 being based on Unicode 4.0. In contrast to earlier character set definitions, Unicode distinguishes between the association of characters as abstract concepts (e.g., "Greek capital letter omega Ω") to a subset of the natural numbers, called code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes.

It would all be (relatively) simple if Unicode were the only standard in effect. Other character sets are in use, typically infested with vendor specific technicalities, and character data is bandied about without much consideration about what a sequence of storage units is intended to represent.

Another source of confusion arises from the limitation of our hardware. While high-resolution monitors let you represent any character in a wide range of glyphs with variations in font, style, size and colour, our keyboards are limited to a relatively small set of characters. This has given rise to the workaround of escape sequences, i.e, a convention by which a character can be represented by a sequence of keys.

Writing Java Programs

A Java program needs to be stored as a "text file" on your computer's file system, but this doesn't mean much except that there is a convention for representing line ends, and even this is cursed by the famous differences between all major OS families. The Java Language Specification is not concerned with the way this text is encoded, even though it says that lexical processing expects this text to contain Unicode characters. That's why a Java compiler features the standard option -encoding encoding. As long as your program doesn't contain anything else but the 26 letters, the 10 digits, white space and the special characters for separators and operators, you may not have to worry much about encoding, provided that the Java compiler is set to accept your system's default encoding and the IDE or editor play along. Check https://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html for a list of supported encodings.

Several encodings map the aforementioned set of essential characters uniformly to the same set of code units of some 8-bit code. The character 'A', for instance, is encoded as 0x41 in US-ASCII, UTF-8 and in any of the codes ISO-8859-1 through ISO-8859-15, or windows-1250 through windows-1258. If you need to represent a Unicode code point beyond 0x7F you can evade all possible misinterpretations by supplying the character in the Unicode escape form defined by the Java language specification: characters '\' and 'u' must be followed by exactly four hexadecimal digits. Using this, the French version of "Hello world!" can be written as

package eu.javaspecialists.tjsn.examples.issue209;

public class AlloMonde {
  public static void main(String[] args) {
    System.out.println("All\u00F4 monde!");
  }
}

Since absolutely any character can be represented by a Unicode escape, you might write this very same program using nothing but Unicode escapes, as shown below, with line breaks added for readability:

\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0063\u006c\u0061\u0073
\u0073\u0020\u0041\u006c\u006c\u006f\u004d\u006f\u006e\u0064\u0065
\u0020\u007b\u000a\u0020\u0020\u0020\u0020\u0070\u0075\u0062\u006c
\u0069\u0063\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0020\u0076
\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028\u0020\u0053
\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0061\u0072\u0067
\u0073\u0020\u0029\u007b\u000a\u0009\u0053\u0079\u0073\u0074\u0065
\u006d\u002e\u006f\u0075\u0074\u002e\u0070\u0072\u0069\u006e\u0074
\u006c\u006e\u0028\u0020\u0022\u0041\u006c\u006c\u00f4\u0020\u006d
\u006f\u006e\u0064\u0065\u0021\u0022\u0020\u0029\u003b\u000a\u0020
\u0020\u0020\u0020\u007d\u000a\u007d\u000a

So, the minimum number of keys you need for such an exercise is 18: the 16 hexadecimal digits plus '\' and 'u'. (On some keyboards you may need the shift key for '\'.)

The preceding tour de force contains several instances of the escape \u000a, which represents the line feed control character - the line separator for Unices. By definition, the Java compiler converts all escapes to Unicode characters before it combines them into a sequence of tokens to be parsed according to the grammar. Most of the time you don't have to worry much about this, but there's a notable exception: using \u000A or \u000D in a character literal or a string literal is not going to create one of these characters as a character value - it indicates a line end to the lexical parser, which is a violation of the rule that neither carriage return nor line feed may occur as themselves within a literal. These are the places where you have to use one of the escape sequences \n and \r. Heinz wrote about this almost 11 years ago in newsletter 50.

Attentive readers might now want to challenge my claim that all Java programs can be written using only 18 keys, which did not include 'n' and 'r'. But there are two ways to make do with these 18 characters. The first one uses an octal escape, i.e., \12 or \15. The other one is the long-winded representation of the two characters of the escape sequence by their Unicode escapes: \u005C\u006E and \u005C\u0072.

Another fancy feature of Java is based on the rule that identifiers may contain any character that is a "Java letter" or a "Java letter-or-digit". The language specification (cf. § 3.8) enumerates neither set explicitly, it delegates the decision to the java.lang.Character methods isJavaIdentifierStart and isJavaIdentifierPart, respectively. This lets you create an unbelievable number of identifiers even as short as only two characters. Investigating all char values yields 45951 and 46908 qualifying values respectively, and this would produce 2,155,469,506 identifiers of length two! (We have to subtract two for the two keywords of length two, of course: do and if.)

The decisions which character may be start or part of a Java identifier exhibit a good measure of laissez-faire. Along with the dollar sign you can use any other currency sign there is. (Isn't ¢lass a nice alternative to the ugly clazz?) More remarkable is the possibility of starting an identifier with characters that are classified as numeric, e.g., Ⅸ, the Roman numeral nine, a single character, is a valid identifier. Most astonishing is the option to use most control characters as part of an identifier, all the more so because they don't have printable representations at all. Here is one example, with a backspace following the initial letter 'A': A\u0008. Given a suitable editor, you can create a source file where the backspace is represented as a single byte, with the expected effect when the file is displayed on standard output:

public class FancyName {
  public static void main( String[] args ){
    String  = "backspace";
    System.out.println();
  }
}

Character Values

We may now try to answer the question how many character values can be stored in a variable of type char, which actually is an integral type. The extreme values Character.MIN_VALUE and Character.MAX_VALUE are 0 and 65535, respectively. These 65536 numeric values would be open to any interpretation, but the Java language specification says that these values are UTF-16 code units, values that are used in the UTF-16 encoding of Unicode texts. Any representation of Unicode must be capable of representing the full range of code points, its upper bound being 0x10FFFF. Thus, code points beyond 0xFFFF need to be represented by pairs of UTF-16 code units, and the values used with these so-called surrogate pairs are exempt from being used as code points themselves. In java.lang.Character we find the static methods isHighSurrogate and isLowSurrogate, simple tests that return true for the ranges '\uD800' through '\uDBFF' and '\uDC00' through '\uDFFF', respectively. Also, by definition, code units 0xFFFF and 0xFFFE do not represent Unicode characters. From this we can deduct that at most 65536 - (0xE000 - 0xD800) - 2 or 63486 Unicode code points can be represented as a char value.

The actual number of Unicode characters that can be represented in a char variable is certainly lower, simply because there are gaps in and between the blocks set aside for the various alphabets and symbol sets.

It is evident that the full range of Unicode code points can only be stored in a variable of type int. This has not always been so: originally, Java was meant to implement Unicode characters where all code points could be represented by a 16-bit unsigned integer. Since that time, Unicode has outgrown this Basic Multilingual Plane (BMP), so that Java SE 5.0 had to make amends, adding character property methods to java.lang.Character, in parallel to existing ones with a char parameter, where the parameter is an int identifying a code point.

Character Strings

When a character can be encoded with a single 16-bit value, a character string can be simply encoded as an array of characters. But the failure of char to cover all Unicode code points breaks the simplicity of this design. Accessing a string based on the progressive count of code points or Unicode characters isn't possible by mere index calculation any more, because code points are represented by one or two successive code units.

Given that we have a String value where surrogate pairs occur intermingled with individual code units identifying a code point, how do you obtain the number of Unicode characters in this string? How can you obtain the n-th Unicode character off the start of the string?

The answers to both questions are simple because there are String methods providing an out-of-the-box solution. First, the number of Unicode characters in a String is obtained like this:

public static int ucLength(String s) {
  return s.codePointCount(0, s.length());
}

Two method calls are sufficient for implementing the equivalent of method charAt, the first one for obtaining the offset of the n-th Unicode character in terms of code unit offsets, whereupon the second one extracts one or two code units for obtaining the integer code point.

public static int ucCharAt(String s, int index) {
  int iPos = s.offsetByCodePoints(0, index);
  return s.codePointAt(iPos);
}

A Capital Case

When the world was young, the Romans used to chisel their inscriptions using just 21 letters in a form called Roman square capitals. This very formal form of lettering was not convenient for everyday writing, where a form called cursiva antigua was used, as difficult to read for us now as it must have been then. Plautus, a Roman comedian, wrote about them: "a hen wrote these letters", which may very well be the origin of the term chicken scratch.

Additional letters, diacritics and ligatures morphing into proper letters are nowadays the constituents of the various alphabets used in western languages, and they come in upper case and lower case forms. Capitalization, i.e., the question when to write the initial letter of a word in upper case, is quite an issue in some languages, with German being a hot contender for the first place, with its baffling set of rules. Moreover, writing headings or emphasized words in all upper case is in widespread use.

As an aside, note that the custom of capitalizing words (as used in English texts) may have subtle pitfalls. (Compare, for instance, "March with a Pole" to "march with a pole", with two more possible forms.)

Java comes with the String methods toUpperCase and toLowerCase. Programmers might expect these methods to produce strings of equal length, and one to be the inverse of the other when initially applied to an all upper or lower case word. But this is not true. One famous case is the German lower case letter 'ß' ("sharp s"), which (officially) doesn't have an upper case form (yet). Executing these statements

Locale de_DE = new Locale( "de", "DE" );
String wort = "Straße";
System.out.println( "word = " + wort );
String WORT = wort.toUpperCase( de_DE );
System.out.println( "WORT = " + WORT );

produces

wort = Straße
WORT = STRASSE

which is correct. Clearly, Character.toUpperCase(char) cannot work this small miracle. (The ugly combination STRAßE should be avoided.) More fun is to be expected in the near (?) future, when the LATIN CAPITAL LETTER SHARP S (U+1E9E) that was added to Unicode in 2008 will be adopted by trendy typesetters (or typing trendsetters), like this: STRAẞE.

Care must be taken in other languages, too. There is, for instance, the bothersome Dutch digraph IJ and ij. There is no such letter in any of the ISO 8859 character encodings and keyboards come without it, and so you'll have to type "IJSSELMEER". Let's apply the Java standard sequence of statements for capitalizing a word to a string containing these letters:

Locale nl_NL = new Locale( "nl", "NL" );
String IJSSELMEER = "IJSSELMEER"; 
System.out.println( "IJSSELMEER = " + IJSSELMEER );
String IJsselmeer =
  IJSSELMEER.substring( 0, 1 ).toUpperCase( nl_NL ) +
  IJSSELMEER.substring( 1 ).toLowerCase( nl_NL );
System.out.println( "IJsselmeer = " + IJsselmeer );

This snippet prints

IJSSELMEER = IJSSELMEER
IJsselmeer = Ijsselmeer

which is considered wrong; "IJsselmeer" would be the correct form. It should be obvious that a very special case like this is beyond any basic character translation you can expect from a Java API.

Kind regards

Wolfgang


To Be Continued ...

In our next part to be published in May 2013, we will look at combining diacritical marks, collating or sorting strings, supplementary characters, property files and show how to write text files with the correct encoding. This whole subject is surprisingly tricky to get right, considering how long humans have been engraving their initials on whatever surface they could.

 

Comments

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Related Articles

Browse the Newsletter Archive

About the Author

Heinz Kabutz Java Conference Speaker

Java Champion, author of the Javaspecialists Newsletter, conference speaking regular... About Heinz

Superpack '23

Superpack '24 Our entire Java Specialists Training in one huge bundle more...

Free Java Book

Dynamic Proxies in Java Book
Java Training

We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.

Java Consulting

We can help make your Java application run faster and trouble-shoot concurrency and performance bugs...