Abstract: Unicode is the most important computing industry standard for representation and handling of text, no matter which of the world's writing systems is used. This newsletter discusses some selected features of Unicode, and how they might be dealt with in Java.
Welcome to the 209th issue of The Java(tm) Specialists' Newsletter, written in Vienna by our friend Dr. Wolfgang Laun who works for Thales Austria GmbH. I went to the CONFESS Java conference in Vienna last week together with my 11 year old daughter as a travel companion. Imagine our surprise when we arrived on the 3rd of April to a snow covered Vienna! Wolfgang had warned me that it was c-o-l-d, but I thought he was exaggerating. Our first port of call was to buy some trekking boots for my girl to keep her feet warm. We are not used to sub-zero temperatures in Crete. On Thursday morning, Wolfgang took us on an interesting walk around Vienna, showing us the original area of the Roman fort, then the architectures through the various eras - Romanesque, Gothic, baroque, etc. We learned a lot about Vienna and saw things that most tourists would miss, such as jutting out stones in narrow passages to keep the wagon wheels away from the wall.
Internationalization is tricky, due to the inexact nature of human communication. I'll never forget the time I tried to transfer £502.46 to my printers. Unfortunately my European banking system discarded the "." and initiated a transfer of £50246! Here in Europe they use the comma as a decimal point and the dot as a thousand separator. Fortunately I was able to cancel the transfer before it had a chance to go very far. Similarly, exchanging text can be surprisingly tricky. Thanks Wolfgang for taking your time to write this article on Unicode for us. I certainly learned a lot from it.
Administrative: We have moved over to a new mailing list, powered by Infusionsoft. Most links in my newsletters will from now on start with "https://iw127.infusionsoft.com/". Don't be alarmed.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
The first Unicode standard was published in 1991, shortly
after the Java project was started. A 16-bit design was
considered sufficient to encompass the characters of all the
world's living languages. Unicode 2.0, which no longer
restricted codepoints to 16 bits, appeared in 1996, but
Java's first release had emerged the year before. Java had to
follow suit, but char
remained a 16-bit type.
This article reviews several topics related to character and
string handling in Java.
System.out.println("To be or not to be\u000Athat is
here the question");
\uHHHH
)?Character.MIN_VALUE
is 0 and
Character.MAX_VALUE
is 65535, how many different
Unicode characters can be represented by a char
variable?String s
has length 1, is the
result of s.toUpperCase()
always the same as
String.valueOf(Character.toUpperCase(s.charAt(0)))
?
The use of "characters" in Java isn't quite as simple as the
type char
might suggest; several misconceptions
prevail. Notice that the word "character" goes back to the
Greek word "χαράζω" (i.e., to scratch, engrave) which may be
the reason why so many scratch their head over the resulting
intricacies.
Several issues need to be covered, ranging from the
representation of Java programs to the implementation of the
data types char
and
java.lang.String
, and the handling of character
data during input and output.
"[Java] programs are written using the Unicode character set." (Language specification, § 3.1) This simple statement is followed by some small print, explaining that each Java SE platform relates to one of the evolving Unicode specifications, with SE 5.0 being based on Unicode 4.0. In contrast to earlier character set definitions, Unicode distinguishes between the association of characters as abstract concepts (e.g., "Greek capital letter omega Ω") to a subset of the natural numbers, called code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes.
It would all be (relatively) simple if Unicode were the only standard in effect. Other character sets are in use, typically infested with vendor specific technicalities, and character data is bandied about without much consideration about what a sequence of storage units is intended to represent.
Another source of confusion arises from the limitation of our hardware. While high-resolution monitors let you represent any character in a wide range of glyphs with variations in font, style, size and colour, our keyboards are limited to a relatively small set of characters. This has given rise to the workaround of escape sequences, i.e, a convention by which a character can be represented by a sequence of keys.
A Java program needs to be stored as a "text file" on your
computer's file system, but this doesn't mean much except
that there is a convention for representing line ends, and
even this is cursed by the famous differences between all
major OS families. The Java Language Specification is not
concerned with the way this text is encoded, even though it
says that lexical processing expects this text to contain
Unicode characters. That's why a Java compiler features the
standard option -encoding
encoding. As
long as your program doesn't contain anything else but the 26
letters, the 10 digits, white space and the special
characters for separators and operators, you may not have to
worry much about encoding, provided that the Java compiler is
set to accept your system's default encoding and the IDE or
editor play along. Check https://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
for a list of supported encodings.
Several encodings map the aforementioned set of essential characters uniformly to the same set of code units of some 8-bit code. The character 'A', for instance, is encoded as 0x41 in US-ASCII, UTF-8 and in any of the codes ISO-8859-1 through ISO-8859-15, or windows-1250 through windows-1258. If you need to represent a Unicode code point beyond 0x7F you can evade all possible misinterpretations by supplying the character in the Unicode escape form defined by the Java language specification: characters '\' and 'u' must be followed by exactly four hexadecimal digits. Using this, the French version of "Hello world!" can be written as
package eu.javaspecialists.tjsn.examples.issue209; public class AlloMonde { public static void main(String[] args) { System.out.println("All\u00F4 monde!"); } }
Since absolutely any character can be represented by a Unicode escape, you might write this very same program using nothing but Unicode escapes, as shown below, with line breaks added for readability:
\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0063\u006c\u0061\u0073 \u0073\u0020\u0041\u006c\u006c\u006f\u004d\u006f\u006e\u0064\u0065 \u0020\u007b\u000a\u0020\u0020\u0020\u0020\u0070\u0075\u0062\u006c \u0069\u0063\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0020\u0076 \u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028\u0020\u0053 \u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0061\u0072\u0067 \u0073\u0020\u0029\u007b\u000a\u0009\u0053\u0079\u0073\u0074\u0065 \u006d\u002e\u006f\u0075\u0074\u002e\u0070\u0072\u0069\u006e\u0074 \u006c\u006e\u0028\u0020\u0022\u0041\u006c\u006c\u00f4\u0020\u006d \u006f\u006e\u0064\u0065\u0021\u0022\u0020\u0029\u003b\u000a\u0020 \u0020\u0020\u0020\u007d\u000a\u007d\u000a
So, the minimum number of keys you need for such an exercise is 18: the 16 hexadecimal digits plus '\' and 'u'. (On some keyboards you may need the shift key for '\'.)
The preceding tour de force contains several instances of the
escape \u000a
, which represents the line feed
control character - the line separator for Unices. By
definition, the Java compiler converts all escapes to Unicode
characters before it combines them into a sequence of
tokens to be parsed according to the grammar. Most of the
time you don't have to worry much about this, but there's a
notable exception: using \u000A
or
\u000D
in a character literal or a string
literal is not going to create one of these characters
as a character value - it indicates a line end to the lexical
parser, which is a violation of the rule that neither
carriage return nor line feed may occur as themselves within
a literal. These are the places where you have to use one of
the escape sequences \n
and \r
.
Heinz wrote about this almost 11 years ago in
newsletter 50.
Attentive readers might now want to challenge my claim that
all Java programs can be written using only 18 keys, which
did not include 'n' and 'r'. But there are two ways to make
do with these 18 characters. The first one uses an
octal escape
, i.e., \12
or
\15
. The other one is the long-winded
representation of the two characters of the escape sequence
by their Unicode escapes: \u005C\u006E
and
\u005C\u0072
.
Another fancy feature of Java is based on the rule that
identifiers may contain any character that is a "Java letter"
or a "Java letter-or-digit". The language specification (cf.
§ 3.8) enumerates neither set explicitly, it delegates the
decision to the java.lang.Character
methods
isJavaIdentifierStart
and
isJavaIdentifierPart
, respectively. This lets
you create an unbelievable number of identifiers even as
short as only two characters. Investigating all
char
values yields 45951 and 46908 qualifying
values respectively, and this would produce 2,155,469,506
identifiers of length two! (We have to subtract two for the
two keywords of length two, of course: do
and if
.)
The decisions which character may be start or part of a Java
identifier exhibit a good measure of laissez-faire. Along
with the dollar sign you can use any other currency sign
there is. (Isn't ¢lass
a nice alternative
to the ugly clazz
?) More remarkable is the
possibility of starting an identifier with characters that
are classified as numeric, e.g., Ⅸ, the Roman numeral
nine, a single character, is a valid identifier. Most
astonishing is the option to use most control characters as
part of an identifier, all the more so because they don't
have printable representations at all. Here is one example,
with a backspace following the initial letter 'A':
A\u0008
. Given a suitable editor, you can create
a source file where the backspace is represented as a single
byte, with the expected effect when the file is displayed on
standard output:
public class FancyName { public static void main( String[] args ){ String = "backspace"; System.out.println(); } }
We may now try to answer the question how many character
values can be stored in a variable of type char
,
which actually is an integral type. The extreme values
Character.MIN_VALUE
and
Character.MAX_VALUE
are 0 and 65535,
respectively. These 65536 numeric values would be open to
any interpretation, but the Java language specification says
that these values are UTF-16 code units, values that
are used in the UTF-16 encoding of Unicode texts. Any
representation of Unicode must be capable of representing the
full range of code points, its upper bound being 0x10FFFF.
Thus, code points beyond 0xFFFF need to be represented by
pairs of UTF-16 code units, and the values used with these
so-called surrogate pairs are exempt from being used
as code points themselves. In
java.lang.Character
we find the static methods
isHighSurrogate
and isLowSurrogate
,
simple tests that return true for the ranges
'\uD800'
through '\uDBFF'
and
'\uDC00'
through '\uDFFF'
,
respectively. Also, by definition, code units 0xFFFF and
0xFFFE do not represent Unicode characters. From this we can
deduct that at most 65536 - (0xE000 - 0xD800) - 2 or 63486
Unicode code points can be represented as a char
value.
The actual number of Unicode characters that can be
represented in a char
variable is certainly
lower, simply because there are gaps in and between the
blocks set aside for the various alphabets and symbol sets.
It is evident that the full range of Unicode code points can
only be stored in a variable of type int
. This
has not always been so: originally, Java was meant to
implement Unicode characters where all code points could be
represented by a 16-bit unsigned integer. Since that time,
Unicode has outgrown this Basic Multilingual Plane (BMP), so
that Java SE 5.0 had to make amends, adding character
property methods to java.lang.Character
, in
parallel to existing ones with a char
parameter,
where the parameter is an int
identifying a code
point.
When a character can be encoded with a single 16-bit value, a
character string can be simply encoded as an array of
characters. But the failure of char
to cover
all Unicode code points breaks the simplicity of this design.
Accessing a string based on the progressive count of code
points or Unicode characters isn't possible by mere index
calculation any more, because code points are represented by
one or two successive code units.
Given that we have a String value where surrogate pairs occur intermingled with individual code units identifying a code point, how do you obtain the number of Unicode characters in this string? How can you obtain the n-th Unicode character off the start of the string?
The answers to both questions are simple because there are String methods providing an out-of-the-box solution. First, the number of Unicode characters in a String is obtained like this:
public static int ucLength(String s) { return s.codePointCount(0, s.length()); }
Two method calls are sufficient for implementing the
equivalent of method charAt
, the first one for
obtaining the offset of the n-th Unicode character in terms
of code unit offsets, whereupon the second one extracts one
or two code units for obtaining the integer code point.
public static int ucCharAt(String s, int index) { int iPos = s.offsetByCodePoints(0, index); return s.codePointAt(iPos); }
When the world was young, the Romans used to chisel their inscriptions using just 21 letters in a form called Roman square capitals. This very formal form of lettering was not convenient for everyday writing, where a form called cursiva antigua was used, as difficult to read for us now as it must have been then. Plautus, a Roman comedian, wrote about them: "a hen wrote these letters", which may very well be the origin of the term chicken scratch.
Additional letters, diacritics and ligatures morphing into proper letters are nowadays the constituents of the various alphabets used in western languages, and they come in upper case and lower case forms. Capitalization, i.e., the question when to write the initial letter of a word in upper case, is quite an issue in some languages, with German being a hot contender for the first place, with its baffling set of rules. Moreover, writing headings or emphasized words in all upper case is in widespread use.
As an aside, note that the custom of capitalizing words (as used in English texts) may have subtle pitfalls. (Compare, for instance, "March with a Pole" to "march with a pole", with two more possible forms.)
Java comes with the String
methods
toUpperCase
and toLowerCase
.
Programmers might expect these methods to produce strings of
equal length, and one to be the inverse of the other when
initially applied to an all upper or lower case word. But
this is not true. One famous case is the German lower case
letter 'ß' ("sharp s"), which (officially) doesn't have an
upper case form (yet). Executing these statements
Locale de_DE = new Locale( "de", "DE" ); String wort = "Straße"; System.out.println( "word = " + wort ); String WORT = wort.toUpperCase( de_DE ); System.out.println( "WORT = " + WORT );
produces
wort = Straße WORT = STRASSE
which is correct. Clearly,
Character.toUpperCase(char)
cannot work this
small miracle. (The ugly combination STRAßE
should be avoided.) More fun is to be expected in the near
(?) future, when the LATIN CAPITAL LETTER SHARP S (U+1E9E)
that was added to Unicode in 2008 will be adopted by trendy
typesetters (or typing trendsetters), like this:
STRAẞE.
Care must be taken in other languages, too. There is, for
instance, the bothersome Dutch digraph IJ and ij.
There is no such letter in any of the ISO 8859 character
encodings and keyboards come without it, and so you'll have
to type "IJSSELMEER
". Let's apply the Java
standard sequence of statements for capitalizing a word to a
string containing these letters:
Locale nl_NL = new Locale( "nl", "NL" ); String IJSSELMEER = "IJSSELMEER"; System.out.println( "IJSSELMEER = " + IJSSELMEER ); String IJsselmeer = IJSSELMEER.substring( 0, 1 ).toUpperCase( nl_NL ) + IJSSELMEER.substring( 1 ).toLowerCase( nl_NL ); System.out.println( "IJsselmeer = " + IJsselmeer );
This snippet prints
IJSSELMEER = IJSSELMEER IJsselmeer = Ijsselmeer
which is considered wrong; "IJsselmeer" would be the correct form. It should be obvious that a very special case like this is beyond any basic character translation you can expect from a Java API.
Kind regards
Wolfgang
In our next part to be published in May 2013, we will look at combining diacritical marks, collating or sorting strings, supplementary characters, property files and show how to write text files with the correct encoding. This whole subject is surprisingly tricky to get right, considering how long humans have been engraving their initials on whatever surface they could.
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.