Abstract: Most of the time our code fails because of bugs in our code. Rarely, however, bugs in the JVM ecosystem cause our systems to fail. These are insanely difficult to diagnose in production systems. In this newsletter we look at two different such race conditions between our code and the JVM.
Welcome to the 234th edition of The Java(tm) Specialists' Newsletter, written during various flights at 30000 feet. Last weekend I was in Graz, Austria, in between two courses. On the Saturday I decided to take advantage of the stunning weather to head up the local Schoeckl mountain. Prior to starting, I asked my friend Holger whether a lot of people were walking up there. "Oh yes, half of Graz, don't worry." I did not want to go hiking in remote areas, lest something happen and I get into trouble. Holger kindly sent me a nice route I could follow on Endomondo. All keen I set out early on Saturday morning, hoping to still get some parking at the Schoecklkreuz.
Unfortunately I had forgotten to load the route into Endomondo prior to leaving the hotel, which I discovered when I parked my car. I thus had to follow the sign posts along the way. The first bit was OK, but then I got to a place where I wasn't quite sure. There were these red circles on the one sign, but I didn't notice them on the way up. The path looked OK, so I headed up. (I found out at the summit that red circle means - "Mittelschwierige Bergwege. Ausdauer und Trittsicherheit erforderlich"). A bit later, I saw some Austrians walk up the same route behind me, so I was fairly confident that I had made the right call. The path crossed a small road and then kept on going straight up through the forest. It did seem odd to me how many pine needles were lying on the path. Straight up did look like the shortest distance to get to the summit. So I trudged on. After what felt like a very long time, I checked my map and discovered that I had only gone up 1/3 of the way! And of course the Austrians who were behind me had decided to go along the small road instead of a goat path up the mountain. Down didn't sit well with me. First off there was my pride. Secondly walking down a treacherous path is more dangerous than up. So I kept on going. Fortunately when I checked the map a second time I was very close to the top. It was a very interesting little walk up the hill actually and I don't regret it, but it was a bit foolish. The pine needles did tell me that.
It all reminded me of the poem "The Road Not Taken" by Robert Frost, where he ends with: "Two roads diverged in a wood, and I - I took the one less traveled by, and that has made all the difference."
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
In May this year, my esteemed friend and colleague Maurice Naftalin and I headed off to Israel to present my Extreme Java - Concurrency and Performance Course to a bunch of smart Intel engineers. As the author of the best-selling book on Generics and Collections, Maurice has a great command of the English language. This, coupled with his vast experience in software engineering, is extremely useful in helping answer some of the tougher questions. We usually tag-team each other, with me preferring to answer questions in my native language, Java. I was flattered that Intel would invite me to teach them. They also enjoyed the course a lot, with my favourite unsolicited comment being: "The three days that we've spent together were amazing. Java is our bread and butter and yet we've learnt many new things and techniques which I'm sure that will help us in the future." - Noam A.
During the course, we got to know all the students quite well, so when I received an email from Dudu Amar in August, telling me that String.compareTo() sometimes gave incorrect results, I knew that I should take his claims seriously. He had found an issue in production where someone was comparing Strings with .compareTo() to determine equality, rather than .equals(), and it would occasionally spit out 0 when they were actually different.
Strangely, the failures only happened on one of his machines. They had lots of other similar machines in production. We did a binary search through Java versions to try and figure out if this failure had been introduced at some specific time. After some testing, Dudu found that it did not happen in Java 1.7.0_25-64, but it did from 1.7.0_40-64 onwards. It also failed in all Java 8 versions he tried. Usually knowing the exact version that an error starts appearing helps narrow down which change might have caused the error. Java has special intrinsic functions for Strings, including compareTo(). Thus the code you see in the String.java file is not what will really be executed.
Sadly, it being Intel with a huge amount of hardware to play with, someone went and upgraded all the CPUs, with the result that Dudu's program no longer fails. We are thus looking for someone, anyone, who can make this program fail on their machine, so that we can keep on trying to figure out what caused it. Please contact me if this fails on your machine. We would like to try some additional tests to narrow down the exact cause.
import java.util.concurrent.*; public class PXMLTag { public final String m_type; private static String[] m_primitiveTypes = { "java.lang.String", "boolean", "int", "long", "float", "short"}; private static String[] m_allTypes = { "java.lang.String", "boolean", "int", "long", "float", "java.util.Vector", "java.util.Map", "short"}; public PXMLTag(String type) { this.m_type = type; } public boolean isPrimitiveType() { for (int i = 0; i < m_primitiveTypes.length; i++) { String type = m_type; String primitiveType = m_primitiveTypes[i]; if (type.compareTo(primitiveType) == 0 && notReallyPrimitiveType(type)) { System.out.println("This odd case"); System.out.println("type '" + type + "'"); System.out.println("currentPrimitiveType '" + primitiveType + "'"); System.out.println("They are equal? " + type.equals(primitiveType)); System.out.println("They are compared " + type.compareTo(primitiveType)); System.out.println("m_type '" + m_type + "'"); System.out.println("m_primitiveType[i] '" + m_primitiveTypes[i] + "'"); System.out.println("They are equal? " + m_type.equals(m_primitiveTypes[i])); System.out.println("They are compared " + m_type.compareTo(m_primitiveTypes[i])); return true; } } return false; } public static boolean notReallyPrimitiveType(String m_type) { return m_type.contains("Vector") || m_type.contains("Map"); } public static String getRandomType() { return m_allTypes[ ThreadLocalRandom.current().nextInt(m_allTypes.length)]; } public static void main(String[] args) { int threads = 1; if (args.length == 1) { threads = Integer.parseInt(args[0]); } for (int i = 0; i < threads; i++) { Thread t = createThread(); t.start(); } } private static Thread createThread() { return new Thread(new Runnable() { @Override public void run() { while (true) { PXMLTag tag = new PXMLTag(getRandomType()); if (tag.isPrimitiveType() && notReallyPrimitiveType(tag.m_type)) { System.out.println(tag.m_type + " not really primitive!"); System.exit(1); } try { Thread.sleep( ThreadLocalRandom.current().nextInt(100)); } catch (InterruptedException e) { return; } } } }); } }
The code failed even with just one thread, but it took longer than with 5 threads.
Interestingly, we can turn off intrinsics of compareTo() in
the String with the VM parameter
-XX:DisableIntrinsic=_compareTo
. (Thanks to
Aleksey Shipilev for that tip.) We tried that and the
program did not fail.
This was the first example of some very strange goings-on in Java code that could only be explained as a race-condition between our code and the JVM. I know that shouldn't happen. But it certainly seems to.
As I mentioned, I took Dudu's report seriously as he had attended my course and I knew that he was smart. I sometimes get slightly "less smart" requests from programmers who think it would be cool if I did their homework for them. My latest was a guy from South Africa asking if I wouldn't mind solving a programming task he had for a job interview! Sometimes when I feel playful I do a Socrates - ask questions that force them to think about the exercise themselves. Usually my questions take more effort to answer than the original homework assignment.
About a year ago someone sent me his homework assignment. He was studying computer science at some university in darkest Africa. His answer was so far off the mark that I gave him the following advice: "Programming is probably not the right career for you. If you struggled with that exercise, then it's not the right thing for you to be doing as a career. I think I would have been able to figure out that exercise when I was about 15 or 16 years old. Sorry for being harsh, but I don't want to get your hopes up."
I expected to never hear from him again. I knew it was a cruel thing to say to someone who had told me: "help me actualize my dreams in programming" I was rather surprised when I received an email from him about six months later. He told me that even though he could've easily been offended, he chose not to and took my advice. Instead of programming, he changed his focus and is now a database administrator and absolutely loves his work. I would be useless as a db admin and I admire him for being so mature to listen without getting angry. Well done mate!
Back to race conditions between our code and the JVM :-)
Another very smart ex-student of mine is Dr Wolfgang Laun. He sent me a code sample that he found on StackOverflow. This code started failing sporadically on the JVM from version 1.8.0_40 onwards. (Actually, isn't that a funny coincidence? Dudu's race condition happened after 1.7.0_40. Maybe 40 is to Java what is to us the number 13?) Before I show you the code (slightly modified from the original StackOverflow example), I would like to show just an excerpt that will make it clear that this simply should not fail:
double array[] = new double[1]; // some code unrelated to array[] array[0] = 1.0; // some more code unrelated to array[] if (array[0] != 1.0) { // could not possibly be true if (array[0] != 1.0) { // and we should definitely not get here either! } else { // I never saw the code get into this else statement } }
Seeing the code above, it should be clear that if we have just set the array[0] element to 1.0, that it would make no sense for it to be anything else but 1.0. In fact it might come back as 0.0!
import java.util.*; // based on https://stackoverflow.com/questions/32994608/java-8-odd-timing-memory-issue // java-8-odd-timing-memory-issue public class OnStackReplacementRaceCondition { private static volatile boolean running = true; public static void main(String... args) { new Timer(true).schedule(new TimerTask() { public void run() { running = false; } }, 1000); double array[] = new double[1]; Random r = new Random(); while (running) { double someArray[] = new double[1]; double someArray2[] = new double[2]; for (int i = 0; i < someArray2.length; i++) { someArray2[i] = r.nextDouble(); } // for whatever reason, using r.nextDouble() here doesn't // seem to show the problem, but the # you use doesn't seem // to matter either... someArray[0] = .45; array[0] = 1.0; // can use any double here instead of r.nextDouble() // or some double arithmetic instead of the new Double new Double(r.nextDouble()); double actual; if ((actual = array[0]) != 1.0) { System.err.println( "claims array[0] != 1.0....array[0] = " + array[0] + ", was " + actual); if (array[0] != 1.0) { System.err.println( "claims array[0] still != 1.0...array[0] = " + array[0]); } else { System.err.println( "claims array[0] now == 1.0...array[0] = " + array[0]); } System.exit(1); } else if (r.nextBoolean()) { array = new double[1]; } } System.out.println("All good"); } }
Output on all Java 8 versions from 1.8.0_40 to 1.8.0_72-ea, and also 1.9.0-ea was sometimes:
claims array[0] != 1.0....array[0] = 1.0, was 0.0 claims array[0] now == 1.0...array[0] = 1.0
The most likely reason for this bug is a race condition between on-stack-replacement (OSR) and our code. OSR does a hot-swap of our method that is currently at the top of the stack with faster, more optimized code. OSR is usually not the fastest machine code. To get that, we need to exit completely from the method and come back in again. For example, consider this class OptimizingWithOSR:
public class OptimizingWithOSR { public static void main(String... args) { long time = System.currentTimeMillis(); double d = testOnce(); System.out.println("d = " + d); time = System.currentTimeMillis() - time; System.out.println("time = " + time); } private static double testOnce() { double d = 0; for (int i = 0; i < 1_000_000_000; i++) { d += 0; } return d; } }
When we run this code with -XX:+PrintCompilation, we see that our testOnce() method is compiled, but not main(). This makes sense, since we hardly execute any code in main(). It all happens in testOnce(), but only once. Here is the output on my machine:
84 73 % 3 OptimizingWithOSR::testOnce @ 4 (22 bytes) 84 74 3 OptimizingWithOSR::testOnce (22 bytes) 84 75 % 4 OptimizingWithOSR::testOnce @ 4 (22 bytes) 86 73 % 3 OptimizingWithOSR::testOnce @ -2 (22 bytes)
The "%" means that it is doing on-stack-replacement. Now in terms of performance, with the default -XX:+UseOnStackReplacement my code runs in about 850ms. If I turn it off with -XX:-UseOnStackReplacement, it runs in 11 seconds and we do not see testOnce() in the compilation output.
It would be easy to jump to the conclusion that OSR is essential and that it would make all our code run gazillions of times faster. Nope. OSR will mostly make large monolithic code run a bit faster. Microbenchmarks, which are anyway unreliable at best, could also complete more quickly. But well-structured, well-factored code will not see much benefit. Thus if you are a good Java developer, you can safely turn it off in production without suffering any slowdown. I refactored the code a bit by breaking the long loop into three and then extracting each of the loops into separate methods:
public class OptimizingWithNormalHotspot { public static void main(String... args) { for (int i = 0; i < 10; i++) { test(); } } private static void test() { long time = System.currentTimeMillis(); double d = testManyTimes(); System.out.println("d = " + d); time = System.currentTimeMillis() - time; System.out.println("time = " + time); } private static double testManyTimes() { double d = 0; for (int i = 0; i < 1_000; i++) { d = unrolledTwo(d); } return d; } private static double unrolledTwo(double d) { for (int j = 0; j < 1_000; j++) { d = unrolledOne(d); } return d; } private static double unrolledOne(double d) { for (int k = 0; k < 1_000; k++) { d = updateD(d); } return d; } private static double updateD(double d) { d += 0; return d; } }
The code now runs exactly the same, whether I have OSR turned on or not.
Thus my recommendation would be to run production systems that are using Java 1.8.0_40 or later with OSR turned off with the switch -XX:-UseOnStackReplacement. Of course you need to test that you don't have any performance degradation, but I would doubt it. Please let me know if you do.
The other race condition I mentioned at the start of this newsletter is more nebulous. Please shout if you manage to reproduce it on your hardware.
I went to see the Danube yesterday with my friend Frans Mahrl who lives in "Orth an der Donau". The level is the lowest it has been since they started recording about 400 years ago. A bit more and we'll be able to wade across.
Kind regards from a sunny Crete
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.