Abstract: The garbage collector can cause our program to slow down at inappropriate times. In this newsletter we look at how we can delay the GC to never run and thus produce reliable results. Then, instead of actually collecting the dead objects, we simply shut down and start again.
Welcome to the 191st issue of The Java(tm) Specialists' Newsletter, sent to you from the stunning Island of Crete. A Greek friend remarked on Sunday how surprised he was at my recent progress with the Greek language. Michaelis had been encouraging me for the last 4 years to knuckle down and put in a bit of an effort.
Learning a new language requires a good memory, which has never been my strong point. I discovered in grade 8 history that I hardly had any short-term memory at all. That, coupled with an illegible scrawl, were serious handicaps in school. As a result, my grades were never particularly good. I was also overconfident and seldom read questions properly before writing my answer. The only time I ever got a distinction in school was for my final exam. It is funny that the rest of my 12 years of school do not count for anything. Nobody was ever interested in how many PT classes I bunked or that I failed "hand writing" in grade 5, all they wanted to know was my final score.
Greek is not the first language I learned. In high school, I decided to learn an African language called isiXhosa, spoken by our former president Nelson Mandela. This was in the years of apartheid, when it was extremely unfashionable for a melanin deprived individual to learn an African language. In my class, I was the most fluent speaker. My secret? I would walk home from school and speak to anybody who cared to listen. I made a lot of mistakes and most of the time I spoke about the same things. "Where are you from? Where are you going? It is hot today". This loosened my tongue and gave me confidence. I do the same with Greek.
My second secret was that I wrote a computer program in GWBasic to help me remember the vocabulary. The program would show me a word and then I would have to write it in isiXhosa. If I made a mistake, I would get punished by having to type it three times. I wrote a similar Java program for Greek, which was challenging because of the strange character set. What surprises me is how this program flashes the words into my medium term memory, without first going into the short-term memory. Even though I don't actively remember the word, it is there in my brain and I can recall it, as long as I do not think too hard. Another surprise is that I am finding it as difficult (or as easy) to learn Greek as I found it to learn isiXhosa, when I was 13 years old. Some expats tell me that they are too old to learn. But it is just as hard for me to remember anything as when I was 13. It is a major effort for me to recall even a single word.
So, in conclusion of this very long opening, if you are finding it hard learning new languages, computer or natural, you are in good company. It is hard for me too. The trick is to use the language, then eventually something will stick :-) (Oh and just for interest, "use" in Greek is xrhsimopoiw :-))
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
A few weeks ago, our French Java Specialist Master Course instructor sent us an interesting puzzle from his brilliant The Coder's Breakfast blog. I love the name, as it evokes good French coffee and a croissant whilst puzzling over the latest Java quiz.
I will rephrase his question a bit, because I think something might have gotten lost in the translation from French into English. With minimal Java code changes, how can you make the Quiz44 class run at least 10x faster?
public class Quiz44 { private static final int NB_VALUES = 10000000; private static final int MAX_VALUE = 1000; private static final int NB_RUNS = 10; public static void main(String[] args) { Integer[] boxedValues = new Integer[NB_VALUES]; int[] values = initValues(); System.out.println("Benchmarking..."); for (int run = 1; run <= NB_RUNS; run++) { long t1 = System.currentTimeMillis(); for (int i = 0; i < NB_VALUES; i++) { boxedValues[i] = values[i]; } long t2 = System.currentTimeMillis(); System.out.printf("Run %2d : %4dms%n", run, t2 - t1); } } /** * Nothing important here, just values init. */ private static int[] initValues() { System.out.println("Generating values..."); int[] values = new int[NB_VALUES]; Random random = new Random(); for (int i = 0; i < values.length; i++) { values[i] = random.nextInt(MAX_VALUE); } return values; } }
When I run this on my MacBook Pro, with -Xmx500m, I see the following results:
Generating values... Benchmarking... Run 1 : 1657ms Run 2 : 2879ms Run 3 : 2265ms Run 4 : 2217ms Run 5 : 2211ms Run 6 : 2216ms Run 7 : 930ms Run 8 : 2216ms Run 9 : 2241ms Run 10 : 2216ms
The average is 2105, with a variance of 254,000, caused by Run 7.
There are some obvious changes to the code that would make it complete faster. For example, we could reduce the number of NB_VALUES or MAX_VALUE fields, but I think Olivier would consider that as cheating. However, there is also a non-obvious change we could make.
For example, we could add a System.gc()
before
the first call to System.currentTimeMillis()
.
In that case, the test runs with the following values:
Generating values... Benchmarking... Run 1 : 1720ms Run 2 : 2645ms Run 3 : 2240ms Run 4 : 927ms Run 5 : 918ms Run 6 : 930ms Run 7 : 965ms Run 8 : 914ms Run 9 : 903ms Run 10 : 913ms
This time the average is 1308 with a variance of 430,000. Not a bad speedup at all. After all, it is now 38% faster than before. It should be obvious that a large proportion of time is wasted with collecting all these unnecessary objects. If we run the code again with -verbose:gc on, we can see the costs involved:
Generating values... [GC 40783K->39473K(83008K), 0.0041315 secs] [Full GC 39473K->39465K(83008K), 0.0696936 secs] [GC 78528K(123980K), 0.0002081 secs] [GC 79209K(123980K), 0.0002467 secs] Benchmarking... [Full GC 79197K->78518K(150000K), 0.0828044 secs] [GC 78518K(150000K), 0.0001254 secs] [GC 78995K(150000K), 0.0006032 secs] [GC 95542K->103000K(150000K), 0.3685923 secs] [GC 103000K(150000K), 0.0021857 secs] [GC 120024K->129617K(150000K), 0.1627059 secs] [GC 146641K->155227K(172464K), 0.1826291 secs] [GC 172251K->180831K(198000K), 0.1499428 secs] [GC 197855K->206365K(223536K), 0.1794985 secs] [GC 223389K->231900K(249072K), 0.1751786 secs] [GC 248924K->257435K(274800K), 0.1594760 secs] [GC 274459K->282969K(300144K), 0.2015765 secs] Run 1 : 1774ms [Full GC 283309K->282255K(300144K), 0.8866413 secs] [GC 315119K->330066K(506684K), 0.3946753 secs] [GC 364178K->398754K(506684K), 0.3282639 secs] [GC 398754K(506684K), 0.0043726 secs] [GC 432866K->449971K(506684K), 0.3649566 secs] [Full GC 484083K->282879K(506684K), 1.1812640 secs] Run 2 : 2708ms [Full GC 284935K->282881K(507776K), 1.0651874 secs] [GC 316993K->345922K(507776K), 0.2532635 secs] [GC 380034K->399611K(507776K), 0.2922708 secs] [GC 400297K(507776K), 0.0042360 secs] [GC 433723K->450807K(507776K), 0.3415709 secs] [Full GC 484919K->282884K(507776K), 1.0057979 secs] Run 3 : 2324ms [Full GC 283571K->282884K(507776K), 0.8885789 secs] [GC 316996K->347295K(507776K), 0.2598463 secs] [GC 381407K->400631K(507776K), 0.3051485 secs] [GC 401318K(507776K), 0.0042195 secs] [GC 434743K->451850K(507776K), 0.3479674 secs] Run 4 : 1024ms [Full GC 485962K->282884K(507776K), 1.0040985 secs] [GC 316996K->347295K(507776K), 0.2591832 secs] [GC 381407K->400631K(507776K), 0.2888177 secs] [GC 401318K(507776K), 0.0042504 secs] [GC 434743K->451850K(507776K), 0.3348886 secs] Run 5 : 994ms [Full GC 485962K->282884K(507776K), 1.0128758 secs] [GC 316996K->347295K(507776K), 0.2580010 secs] [GC 381407K->400631K(507776K), 0.2884526 secs] [GC 402692K(507776K), 0.0060617 secs] [GC 434743K->451848K(507776K), 0.3290486 secs] Run 6 : 1004ms [Full GC 485960K->282884K(507776K), 1.0040235 secs] [GC 316996K->347295K(507776K), 0.2596790 secs] [GC 381407K->400631K(507776K), 0.2851338 secs] [GC 401318K(507776K), 0.0042191 secs] [GC 434743K->451840K(507776K), 0.3340752 secs] Run 7 : 989ms [Full GC 485952K->282884K(507776K), 1.0022637 secs] [GC 316996K->347295K(507776K), 0.2612456 secs] [GC 381407K->400631K(507776K), 0.2933666 secs] [GC 401318K(507776K), 0.0043201 secs] [GC 434743K->451842K(507776K), 0.3280430 secs] Run 8 : 997ms [Full GC 485954K->282884K(507776K), 1.0126833 secs] [GC 316996K->347295K(507776K), 0.2569432 secs] [GC 381407K->400631K(507776K), 0.2866691 secs] [GC 401318K(507776K), 0.0042206 secs] [GC 434743K->451842K(507776K), 0.3418867 secs] Run 9 : 1000ms [Full GC 485954K->282884K(507776K), 1.0074827 secs] [GC 316996K->347295K(507776K), 0.2594386 secs] [GC 381407K->400631K(507776K), 0.2966164 secs] [GC 401318K(507776K), 0.0042397 secs] [GC 434743K->451840K(507776K), 0.3391696 secs] Run 10 : 1009ms [Full GC 485952K->400K(507776K), 0.3041337 secs]
The Full GC's in italics are the artificial GCs that we introduced to make our test run faster. If we exclude them from the timing, we can work out how much of each run is spent collecting garbage:
Run GC Program Total 1 1583 191 1774 2 2274 434 2708 3 1897 427 2324 4 917 107 1024 5 887 107 994 6 882 122 1004 7 883 106 989 8 887 110 997 9 890 110 1000 10 899 101 1009
We can get even better figures from our program if instead of elapsed time (System.currentTimeMillis()), we use user CPU time (ThreadMXBean.getCurrentThreadUserTime()). Here are the run times:
Generating values... Benchmarking... Run 1 : 1667ms 106ms Run 2 : 2576ms 114ms Run 3 : 2199ms 109ms Run 4 : 897ms 109ms Run 5 : 899ms 109ms Run 6 : 899ms 109ms Run 7 : 903ms 109ms Run 8 : 889ms 109ms Run 9 : 892ms 109ms Run 10 : 896ms 108ms
Our average user CPU time is now 109ms with a variance of 4.
Let's get back to the problem. How do we make the test run 10 times faster with minimal code changes? My first approach was to simply delay the cost of GC until after the test has run through. Since I have 8GB of memory in my laptop, I can beef up the initial and maximum size of the heap to a huge amount. I also set the NewSize to a crazy amount, so that the GC never needs to run whilst the test is running.
$ java -showversion -Xmx10g -Xms10g -XX:NewSize=9g Quiz44 java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode) Generating values... Benchmarking... Run 1 : 198ms Run 2 : 203ms Run 3 : 179ms Run 4 : 192ms Run 5 : 179ms Run 6 : 191ms Run 7 : 183ms Run 8 : 192ms Run 9 : 180ms Run 10 : 197ms
This gave me an average of 189ms with a slightly high variance of 75. Since my initial results had an average of 2105, I achieved Olivier's target of being at least 10x faster. In case you think what I've done is really stupid, I know of companies who use this trick to avoid all GC pauses. They try to never construct objects. They make the young generation very large. After running for several hours, they quit the application and start again. This gives them very predictable run times. So it's not as harebrained an idea as it appears at first.
In our case, things start to go wrong when we increase the number of runs, for example, let's change NB_RUNS to 100:
java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode) Generating values... Benchmarking... Run 1 : 188ms Run 2 : 214ms Run 3 : 234ms Run 4 : 191ms Run 5 : 203ms Run 6 : 193ms Run 7 : 207ms Run 8 : 188ms Run 9 : 208ms Run 10 : 183ms Run 11 : 199ms Run 12 : 183ms Run 13 : 200ms Run 14 : 186ms Run 15 : 202ms Run 16 : 192ms Run 17 : 216ms Run 18 : 191ms Run 19 : 211ms Run 20 : 188ms Run 21 : 210ms Run 22 : 190ms Run 23 : 210ms Run 24 : 187ms Run 25 : 211ms Run 26 : 186ms Run 27 : 208ms Run 28 : 188ms Run 29 : 205ms Run 30 : 188ms Run 31 : 216ms Run 32 : 189ms Run 33 : 210ms Run 34 : 186ms Run 35 : 215ms Run 36 : 190ms Run 37 : 222ms Run 38 : 191ms Run 39 : 209ms Run 40 : 186ms Run 41 : 1068ms Run 42 : 7281ms Run 43 : 8705ms Run 44 : 8583ms Run 45 : 7675ms Run 46 : 6346ms Run 47 : 1625ms Run 48 : 896ms Run 49 : 834ms Run 50 : 948ms Run 51 : 554ms Run 52 : 901ms Run 53 : 904ms Run 54 : 4253ms Run 55 : 815ms Run 56 : 111ms Run 57 : 127ms Run 58 : 143ms Run 59 : 159ms Run 60 : 117ms Run 61 : 138ms Run 62 : 136ms Run 63 : 140ms Run 64 : 128ms Run 65 : 131ms Run 66 : 136ms Run 67 : 147ms Run 68 : 122ms Run 69 : 127ms Run 70 : 135ms Run 71 : 135ms Run 72 : 135ms Run 73 : 134ms Run 74 : 126ms Run 75 : 125ms Run 76 : 217ms Run 77 : 765ms Run 78 : 2973ms Run 79 : 2459ms Run 80 : 2180ms Run 81 : 4733ms Run 82 : 7484ms Run 83 : 3177ms Run 84 : 6209ms Run 85 : 5129ms Run 86 : 1051ms Run 87 : 5177ms Run 88 : 5515ms Run 89 : 6217ms Run 90 : 8865ms Run 91 : 7981ms Run 92 : 3578ms Run 93 : 3472ms Run 94 : 3645ms Run 95 : 4006ms Run 96 : 3933ms Run 97 : 4211ms Run 98 : 3127ms Run 99 : 3714ms Run 100 : 3811ms
This gives us an average of 1656 and variance of 6,000,000. There are also times when the system is almost unusable because of the time wasted in GC.
One of our Java Specialist Club members, Marek Bickos, quickly discovered a flag that made this code run much faster. By simply invoking the code with -XX:+AggressiveOpts, it caused no GC at all:
Generating values... Benchmarking... Run 1 : 56ms Run 2 : 45ms Run 3 : 39ms Run 4 : 50ms Run 5 : 66ms Run 6 : 39ms Run 7 : 39ms Run 8 : 40ms Run 9 : 46ms Run 10 : 46ms
A bit of experimentation explained why this was so much faster. In modern JVMs, the autoboxing cache size is configurable. It used to be that all values from -128 to 127 were cached, but nowadays we can specify a different upper bound. When we run the code with -XX:+AggressiveOpts, it simply increases this value. We can try it out with this little program:
public class FindAutoboxingUpperBound { public static void main(String[] args) { int i=0; while(isSame(i,i)) { i++; } System.out.println("Upper bound is " + (i - 1)); i = 0; while(isSame(i,i)) { i--; } System.out.println("Lower bound is " + (i + 1)); } private static boolean isSame(Integer i0, Integer i1) { return i0 == i1; } }
If we run this without any flags, it comes back as:
Upper bound is 127 Lower bound is -128
With -XX:+AggressiveOpts, we get these results:
Upper bound is 20000 Lower bound is -128
It is thus pretty obvious why it is so much faster. Instead of constructing new objects, it is simply getting them from the Integer cache.
If we increase the MAX_VALUE to 100000, and run the program with -XX:+AggressiveOpts and -Xmx500m, we get these results:
Generating values... Benchmarking... Run 1 : 1622ms Run 2 : 1351ms Run 3 : 2675ms Run 4 : 2037ms Run 5 : 2322ms Run 6 : 2325ms Run 7 : 2195ms Run 8 : 2394ms Run 9 : 2264ms Run 10 : 2109ms
However, with my flags of "-Xmx10g -Xms10g -XX:NewSize=9g", it is still as fast as before.
The question is thus, which is the better solution? I think it depends very much on how this is going to be used in the real world. My solution would work with any objects that are being made, not just Integers. However, we also saw the serious downside to my approach. Once we do fill up our young space, the JVM will get very slow, at least for a while.
The solution Olivier was looking for was to set the maximum
number of values in the integer cache using the system
parameter java.lang.Integer.IntegerCache.high
,
for example:
java -Djava.lang.Integer.IntegerCache.high=1000 Quiz44 Generating values... Benchmarking... Run 1 : 40ms Run 2 : 54ms Run 3 : 48ms Run 4 : 49ms Run 5 : 66ms Run 6 : 48ms Run 7 : 48ms Run 8 : 49ms Run 9 : 48ms Run 10 : 66ms
The -XX:AutoBoxCacheMax=1000 flag would do the same thing. Whilst it is possible to specify the maximum Integer in the cache, no doubt to satisfy some silly microbenchmark to prove that Java is at least as fast as Scala, it is currently not possible to set the minimum Integer.
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.