Abstract: We continue our discussion on Unicode by looking at how we can compare text that uses diacritical marks or special characters such as the German Umlaut.
Welcome to the 211th issue of The Java(tm) Specialists' Newsletter, written in Vienna by our friend Dr. Wolfgang Laun who works for Thales Austria GmbH. This is the second of two parts of a newsletter on the topic of Unicode. You can read the first part here. It was a bit difficult to send out the first part, since I discovered that my wonderful CRM providers were storing characters as ISO-8859-1, rather than a more sensible encoding such as UTF-8. It took me an inordinate amount of time to get most the characters to display correctly, and then they still showed up wrong on some systems. Instead of Unicode Redux, it gave me Unicode Reflux. This edition will probably have worse issues.
Thank you also to those who wrote to congratulate us on the birth of the latest member of our family, Efigenia Gabriela Kabutz, born on the 13th of May 2013. Imagine what the world will look like when Efigenia is your age! I still remember seeing Skype-like devices on Star Trek when I was a kid and being sure that something like that would never exist. We had one of those rotating dials that used voltage pulses to dial the number. You could actually phone without dialing by tapping the off button quickly in a row for all the numbers. So our number 492019 was four taps, nine taps, two taps, ten taps, one tap and nine taps. It was the way you could use the phone if someone had locked the rotating dial. Oh and in our holiday house in Hermanus, which we ended up selling to F.W. de Klerk, we had a phone with a crank that would generate a signal for the exchange. We would then tell the exchange operator which number we wanted and they would connect us. I remember that. Maybe one day Efigenia will reminisce about how when she grew up, she used to plug her appliances into wall sockets for power!
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
In the first half of this article, we showed how you can write complete Java programs with just 18 keys. We also explained how character values worked and how character strings were made up. We ended off showing how complicated upper case can be when you have different languages that might not have characters for an upper case letter.
words
can be called to produce an output like the one shown below?
private static void words(String w1, String w2 ){ String letPat = "[^\\p{Cntrl}]+"; assert w1.matches(letPat) && w2.matches(letPat); System.out.println(w1 + " - " + w2 + ": " + w1.equals(w2)); }
Genève - Genève: false
The codepoints in the Unicode block combining diacritical marks might be called the dark horses in the assembly of letters. They are nothing on their own, but when following the right kind of letter, they unwaveringly exercise their influence, literally "crossing the t's and dotting the i's". They occur in non-Latin alphabets, and they add an almost exotic flavour to the Latin-derived alphabet, with, e.g., the diaeresis decorating vowels (mostly) and the háček adding body to consonants (mostly).
The combining marks can be used in fanciful ways, for
instance: O͜O - which should display as if
your browser and operating system are rendering the
diacritical mark correctly. While there are numerous Unicode
codepoints for precombined letters with some diacritic, it is
also permitted to represent them by their basic letter
followed by the combining diacritical mark, and some
applications might prefer it to do it that way. You can guess
that this means trouble if your software has to compare
words. Method equals
in String
is
certainly not prepared to deal with such subtleties, unless
the strings have been subjected to a process called
normalization. This can be done by applying the static
method normalize
of class
java.text.Normalizer
. Here is a short
demonstration.
import java.text.Normalizer; public class Normalize { public boolean normeq(String w1, String w2) { if (w1.length() != w2.length()) { w1 = Normalizer.normalize(w1, Normalizer.Form.NFD); w2 = Normalizer.normalize(w2, Normalizer.Form.NFD); } return w1.equals(w2); } public void testEquals(String w1, String w2) { System.out.println(w1 + " equals " + w2 + " " + w1.equals(w2)); System.out.println(w1 + " normeq " + w2 + " " + normeq(w1, w2)); } }
The enum constant Normalizer.Form.NFD
selects
the kind of normalization to apply; here it is just the
decomposition step that separates precombined letters into a
Latin letter and the diacritical mark. Let's try it out:
public class NormalizeTest { public static void main(String[] args) { Normalize norm = new Normalize(); norm.testEquals("Genève", "Gene\u0300ve"); norm.testEquals("ha\u0301ček", "hác\u030cek"); } }
We can see this output. Warning: you might need to use the correct font to view this properly.
Genève equals Genève false Genève normeq Genève true háček equals háček false háček normeq háček false
Do you see what went wrong? The blunder is in method
normeq
: you can't assume that equal lengths
indicate the same normalization state. In the second pair of
words, one was written with the first letter composed and the
second one decomposed and the other one vice versa, so string
lengths are equal, not the character arrays, but the word is
the same. There is no shortcut, but we can use this
optimistic approach:
public boolean normeq(String w1, String w2) { if (w1.equals(w2)) { return true; } else { w1 = Normalizer.normalize(w1, Normalizer.Form.NFD); w2 = Normalizer.normalize(w2, Normalizer.Form.NFD); return w1.equals(w2); } }
Class java.lang.String
implements
java.lang.Comparable
, but its method
compareTo
is just a rudimentary effort, with a
resulting collating sequence that isn't good for anything
except for storing strings in an array where binary search is
used. Consider, for instance, these four words, which are
presented in the order Germans expect them in their
dictionaries: "Abend", "aber", "morden", "Morgen". Applying
Arrays.sort
to this set yields "Abend",
"Morgen", "aber", "morden", due to all upper case letters in
the range 'A' to 'Z' preceding all lower case letters.
Treating an upper case and the corresponding lower case letter as (almost) equal is just one of the many deviations from the character order required in a useful collation algorithm. Also, note that there's a wide range of applied collations, varying by language and usage. German dictionaries, for instance, use a collation where vowels with a diaeresis are ranked immediately after the unadorned vowel, and the letter 'ß', originally resulting from a ligature of 'ſ' (long s) and 'z', is treated like 'ss'. But for lists of names, as in a telephone book, the German Standard establishes the equations 'ä' = 'ae', 'ö' = 'oe' and 'ü' = 'ue'. Book indices may require a very detailed attention, e.g., when mathematical symbols have to be included.
The technical report Unicode Collation Algorithm (UCA) contains a highly detailed specification of a general collation algorithm, with all the bells and whistles required to cope with all nuances for ordering. For anyone planning a non-trivial application dealing with Unicode strings and requiring sorting and searching, this is a must-read, and it's highly informative for anybody with an interest in languages.
Even if not all intricacies outlined in the UCA report are
implemented, a generally applicable collating algorithm must
support the free definition of collating sequences,
and it is evident that this requires more than just the
possibility of defining an arbitrary ordering of the
characters. The class RuleBasedCollator
in
java.text
provides the most essential features
for this. Here is a simple example for the use of
RuleBasedCollator
.
import java.text.*; import java.util.*; public class GermanSort implements Comparator<String> { private final RuleBasedCollator collator; public GermanSort() throws ParseException { collator = createCollator(); } private RuleBasedCollator createCollator() throws ParseException { String german = "" + "= '-',''' " + "< A,a;ä,Ä< B,b< C,c< D,d< E,e< F,f< G,g< H,h< I,i< J,j" + "< K,k< L,l< M,m< N,n< O,o;Ö,ö< P,p< Q,q< R,r< S,s< T,t" + "< U,u;Ü,ü< V,v< W,w< X,x< Y,y< Z,z" + "& ss=ß"; return new RuleBasedCollator(german); } public int compare(String s1, String s2) { return collator.compare(s1, s2); } public void sort(String[] strings) { Arrays.sort(strings, this); } }
The string german
contains the definition of the
rules, ranking the 26 letters of the ISO basic Latin alphabet
by using the primary relational operator '<'. A weaker
ordering principle is indicated by a semicolon, which places
an umlaut after its stem vowel, and even less significant is
the case difference, indicated by a comma. The initial part
defines the hyphen and the apostrophe as ignorable
characters. The last relations reset the position to 's', and
rank 'ß' as equal to 'ss'. (Note: The javadoc for this class
is neither complete nor correct. Use the syntax illustrated
in the preceding example for defining ignorables.)
There is, however, a much simpler way to obtain a
Collator
that is adequate for most collating
tasks (or at least a good starting point): simply call method
getInstance
, preferably with a
Locale
parameter. This returns a prefabricated
RuleBasedCollator
object, according to the
indicated locale. Make sure to select the locale not only
according to language, since the country may affect the
collating rules. Also, the Collator
instances
available in this way may not be up-to-date, as the following
little story illustrates. There used to be a French collating
rule, requiring the words "cote", "côte", "coté" and "côté"
to be in this order, which is in contrast to normal accent
ordering, i.e., "cote", "coté" , "côte" and "côté". Not too
long ago, this fancy rule has retracted to Canada. But, even
with JDK 7, you may have to create a modified
Collator
by removing the modified '@' from the
string defining the sort rules:
Collator collator = Collator.getInstance(new Locale("fr", "FR")); String rules = ((RuleBasedCollator)collator).getRules(); // '@' is last rules = rules.substring(0, rules.length()-1); collator = new RuleBasedCollator(rules);
(Making the preceding code robust is left as an exercise to the reader.)
Comparing Unicode strings according to a rule based collation
is bound to be a non-trivial process while the collator rules
must be taken into account. You can get an idea of what this
means when you look at class
CollationElementIterator
. This iterator,
obtainable for strings by calling the
RuleBasedCollator
method
getCollationElementIterator
, delivers sequences
of integers that, when compared to each other, result in the
correct relation according to the collator. These integers
are quite artsy combinations of a character, or a character
and the next one; even two or more key integers may result
from a single character. For a once-in-a-while invocation of
a collator's compare
method this isn't going to
hurt, but sorting more than a fistful of strings is an
entirely different matter.
This is where class CollationKey
comes to the
rescue. Objects are created by calling the (rule based)
collator method getCollationKey
for a string.
Each object represents a value equivalent to the string's
unique position in the set of all strings sorted according to
this collator.
Putting this all together, an efficient sort of a collection
of strings should create a collection of collation keys and
sort it. Conveniently enough, the CollationKey
method getSourceString
delivers the
corresponding string from which the key was created. This is
shown in the sort method given below.
public String[] sort(String[] strings) { CollationKey[] keys = new CollationKey[strings.length]; for (int i = 0; i < strings.length; i++) { keys[i] = collator.getCollationKey(strings[i]); } Arrays.sort( keys ); String[] sorted = new String[strings.length]; for (int i = 0; i < sorted.length; i++) { sorted[i] = keys[i].getSourceString(); } return sorted; }
Supplementary characters that need to be expressed with surrogate pairs in UTF-16 are uncommon. However, it's important to know where they can turn up and may require precautions in your application code. They include:
At least, Java applications for mobile devices will have to be aware of the growing number of Emoji symbols. Games could be another popular reason for the need to include supplementary characters.
The javadoc for java.util.Properties
states that
the load and store methods read and write a character stream
encoded in ISO 8859-1. This is an 8-bit character set,
containing a selection of letters with diacritical marks as
used in several European languages in addition to the
traditional US-ASCII characters. Any other character must be
represented using the Unicode escape (\uHHHH
).
This is quite likely to trip you up when you trustingly edit
your properties file with an editor that's been educated to
understand UTF-8. Although all printable ISO 8859-1
characters with code units greater than 0x7F happen to map to
Unicode code points that are numerically equal to these code
points, their UTF-8 encoding requires two bytes. (The
appearance of 'Â' or 'Ā' in front of some other
character is the typical evidence of such a
misunderstanding.) Moreover, it's easy to create a character
not contained in the ISO 8859-1 set. On my Linux system,
emacs lets me produce the trade mark sign (™) with a
few keystrokes. For a remedy, the same javadoc explains that
the tool native2ascii
accompanying JDK may be
used to convert a file from any encoding to ISO 8859-1 with
Unicode escapes.
The Properties
methods
loadFromXML(InputStream)
and
storeToXML(OutputStream, String, String)
read
and write XML data, which should indicate its encoding in the
XML declaration. It may be more convenient to use these
methods than the edit-and-convert rigmarole required for a
simple character stream.
We call a file a "text file" if its data is meant to be a sequence of lines containing characters. While any programming language may have its individual concept of the set of characters it handles as a value of a character data type and a singular way of representing a character in memory), things aren't quite as simple as soon as you have to entrust your data to a file system. Other programs, on the same or on another system, should be able to read that data and be able to interpret it so that they may come up with the same set of characters. Standards institutes and vendors have created an overly rich set of encodings, prescriptions for mapping byte sequences to character sequences. On top of that, there are the various escape mechanisms which let you represent characters not contained in the basic set as sequences of characters from that set. The latter is an issue of interpretation according to various text formats, such as XML or HTML, and we'll skip it here.
Writing a sequence of characters and line terminators to a
file should be a simple exercise, and the API of
java.io
does indeed provide all the essentials,
but there two things to consider. First, what should become
of a "character" when it is stored on the medium or sent over
the wire; second, how are lines separated.
If the set of characters in the text goes beyond what can be
represented with one of the legacy encodings that use one
8-bit code unit per character, one of the Unicode encoding
schemes UTF-8, UTF-16 or UTF-32 must be chosen, and it should
be set explicitly as it is risky to rely on the default
stored in the system property file.encoding
.
Which one should you choose, provided you have a choice at all? If size matters, consider that UTF-16 produces 2 bytes per character, whereas UTF-8 is a variable-width encoding, requiring 1, 2, 3 or more bytes for each codepoint. Thus, if your text used characters from US-ASCII only, the ratio between UTF-8 and UTF-16 will be 1:2, and if you are writing an Arabic text, the ratio is bound to be 1:1, and for CJK it will be 3:2. Compressing the files narrows the distance considerably. Anyway, UTF-8 has become the dominant character encoding for the World-Wide Web, it is increasingly used as the default in operating systems. Therefore it's hard to put up a case against using this very flexible encoding.
Delving into Unicode's mysteries is a highly rewarding adventure. We have seen that Java provides some support for text processing according to the Unicode standard, but you should always keep in mind that this support may not be sufficient for more sophisticated applications. This has been one of the two motives for writing this article. And what was the other one? Ah, yes, having fun!
Kind regards
Wolfgang
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.