Abstract: Java 8 HashMap has been optimized to avoid denial of service attacks with many distinct keys containing identical hash codes. Unfortunately performance might degrade if you use your own keys. In this newsletter we show a tool that you can use to inspect your HashMap and view the key distribution within the buckets.
Welcome to the 235th edition of The Java(tm) Specialists' Newsletter, sent on the 29th February. I looked back through 16 years of writing newsletters, and this is the first time I'm sending it out on a leap day. For fun, you should ask some non-programmers to tell you how the leap-year rule works and see if anyone knows :-) (Hint: it's not year%4 == 0)
Whilst spending a dark Christmas in England, I had the opportunity to read How to Fail at Almost Everything and Still Win Big: Kind of the Story of My Life [ISBN 1591846919] , an autobiography by our favourite Dilbert cartoonist. The book changed my life. As soon as I got back to Crete, I set up a salad bar in my fridge. So far this year I've lost 6 kg despite three business trips. I did my fastest 3 km run ever yesterday. You might not realize this, but our jobs as Java programmers are high risk. Cardiovascular diseases are the leading cause of death globally. Adams' book helped me get my life in order in a matter of weeks.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
In the beginning, we had Hashtable. It was synchronized. It had a threading bug that was only removed in Java 7 (exercise for the reader is to find it). It did not support compound operations. Hashtable used the traditional approach of a sparse array containing the table entries, each offset by the remainder of the key's hashCode() function.
When HashMap was added in Java 2, it used a similar way of assigning entries to buckets using the remainder operator (%). In Java 4, Joshua Bloch introduced a new approach using bit masking (&). It is becoming common knowledge that divide is one of the most expensive arithmetic operation that our CPU can do. Think back to junior school, when your sadistic maths teacher would punish you with long division: "Jimmy, what is 151234 / 124? You have 5 minutes!" Besides satisfying a deep-seated desire to inflict pain on others, my teachers also kept on repeating the same mantra: "You will not always have a calculator with you!" Um. iPhone anyone? Think back on those dark days where you were forced to do nasty divisions and you will know just how our CPU feels when we throw % at it.
Bit masking is far more pleasant. It's like asking us to do a long division - by a number that's a power of 10. So what is 123211237126 / 10000? Sweet, it's 12321123.7126 - just move the decimal by as many places as you have zeros. It is thus self-explanatory that & would be an improvement over %.
There is just one catch. Bit masking is far more sensitive to hash codes that have mostly the higher bits set. Bloch solved it in the first version of Java 1.4 by doing a second hash on our hash function. His hash didn't lose bits, so if you had written a "perfect hash function" (every distinct object also has a unique hash code value), then you would still have a perfect hash function after the second hash. Here is how version 1 of the hash() looked (required some serious code archeology on my side to dig that out):
int hash(Object key) { int h = key == null ? 0 : key.hashCode(); h += ~(h << 9); h ^= (h >>> 14); h += (h << 4); h ^= (h >>> 10); return h; }
When I look at code like that I go "Yep, should probably work". I have no idea why it would, but it looks complicated enough that the bits should be moved along down without losing any. It turned out, however, that for certain keys, there were quite a few collisions in the table buckets, even with a perfect hash function. The HashMap at the time used a linked list of entries when we had collisions. If you were unlucky, you could end up with your HashMap or Hashtable degrading to O(n).
A few releases later, they improved this hash() inside the HashMap to:
int hash(Object key) { int h = key == null ? 0 : key.hashCode(); h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }
Again, all I could say was: "Looks better ... I guess."
Turns out, in all the tests I did, the newer function was indeed better and performance improved.
For a while now already, there were some nasty denial-of-service attacks going around where you would send lots of String objects to a server that would always result in the same identical hash code. For extra effect, you would ensure that the hashCode() of the String would equal 0, thus the cached value did not work properly and it would recalculate every time.
To address this issue, HashMap was modified in Java 8 to store nodes inside a binary tree if the number of clashes exceeded 11 (Thanks to Olaf Fricke for pointing this out). This would slash the complexity to O(log n). The denial-of-service attacks stopped. Life was good. So good in fact, that it was decided to replace that rather complicated "Bloch approved" hash() shown above with a simpler, faster hash():
int hash(Object key) { int h; return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); }
"Um", I thought, "that's just gotta be worse for the key distribution!"
The question was - how could I test this? I filled a hash map with entries and then tried to view this with a debugger. Unfortunately IntelliJ IDEA did not let me peep into the gory details of HashMap, but instead presented me with a beautifully sanitised view where no clashes occurred.
Since I was really interested in seeing how many clashes I could have per bucket, even with a perfect hash code, I wrote this little utility to make sense of what is happening in the buckets:
import java.lang.reflect.*; import java.util.*; public class MapClashInspector { private interface MapProcessor { void beginBucket(); void process(Map.Entry<?, ?> node); void endBucket(Map<Integer, Integer> count); } /** * Returns a map showing as key the number of clashes and * as value the number of entries with identical hash codes. * With a "perfect" hash function, the map will contain only * one entry with 1 as a key and the number of entries in the * map as a value. */ public static Map<Integer, Integer> getHashClashDistribution( Map<?, ?> map) throws NoSuchFieldException, IllegalAccessException { return getBucketDistribution(map, new MapProcessor() { private final Map<Integer, Integer> numberOfClashes = new HashMap<Integer, Integer>(); public void beginBucket() { numberOfClashes.clear(); } public void process(Map.Entry<?, ?> node) { increment(numberOfClashes, node.getKey().hashCode()); } public void endBucket(Map<Integer, Integer> count) { for (Integer val : numberOfClashes.values()) { increment(count, val); } } }); } /** * Returns a map showing as key the number of clashes and * as value the number of buckets with this number of clashes. * In a "perfect" distribution, we would have * 1->numberOfEntriesInMap. The worst possible distribution * is numberOfEntriesInMap->1, where all the entries go into a * single bucket. It also shows the number of empty buckets. * The Java 8 HashMap copes well with clashes, but earlier * versions would become very slow due to O(n) lookup. */ public static Map<Integer, Integer> getBucketClashDistribution( Map<?, ?> map) throws NoSuchFieldException, IllegalAccessException { return getBucketDistribution(map, new MapProcessor() { private int size; public void beginBucket() { size = 0; } public void process(Map.Entry<?, ?> node) { size++; } public void endBucket(Map<Integer, Integer> count) { increment(count, size); } }); } /** * Increment the value if already exists; otherwise set to 1. */ private static void increment( Map<Integer, Integer> map, int size) { Integer counter = map.get(size); if (counter == null) { map.put(size, 1); } else { map.put(size, counter + 1); } } private static Map<Integer, Integer> getBucketDistribution( Map<?, ?> map, MapProcessor processor) // Since Java 7, we can throw ReflectiveOperationException throws NoSuchFieldException, IllegalAccessException { Map.Entry<?, ?>[] table = getTable(map); Field nextNodeField = getNextField(table); Map<Integer, Integer> numberPerBucket = new TreeMap<Integer, Integer>(); for (Map.Entry<?, ?> node : table) { processor.beginBucket(); while (node != null) { processor.process(node); node = (Map.Entry<?, ?>) nextNodeField.get(node); } processor.endBucket(numberPerBucket); } return numberPerBucket; } private static Map.Entry<?, ?>[] getTable(Map<?, ?> map) throws NoSuchFieldException, IllegalAccessException { Field tableField = map.getClass().getDeclaredField("table"); tableField.setAccessible(true); return (Map.Entry<?, ?>[]) tableField.get(map); } private static Field getNextField(Object table) throws NoSuchFieldException { Class<?> nodeType = table.getClass().getComponentType(); Field nextNodeField = nodeType.getDeclaredField("next"); nextNodeField.setAccessible(true); return nextNodeField; } }
A few constraints about the code above. I wanted to support Java 6, so that we can compare the evolution of java.util.HashMap through generations of Java. Thus I don't use the lovely new Streams API to process the table and I even give the diamond operator <> a miss.
When we run this on different versions of Java, we see how the distributions have changed. Prior to Java 8, a perfect hash function would mean a pretty good chance that the entries would be held in separate buckets. Now, we need to be more careful. For example, we have the famous Effective Java approach that is used in Objects.hash(), where we incrementally multiply the current value by the prime number 31, before adding the new value. This works brilliantly with Strings, but not so well in other scenarios. The reason is that most characters that we use in Java fall within a range of 26, thanks to the Latin alphabet. Thus multiplying by 31 is big enough to reduce the chance of clashes.
In order to test different approaches, I wrote a Pixel class, containing fields "x" and "y". Now please don't complain about these variable names :-) They are the x and y coordinates of a pixel. It would be hard to come up with better or more descriptive names. I also used a bunch of hash code functions that we can select for each run.
Also, you will notice that my Pixel class implements Comparable. Since Java 8 it is VERY IMPORTANT to always make your custom keys for HashMap implement Comparable. HashMap now puts your entries into a tree to avoid degradation to O(n) when the entries map to the same bucket. It somehow needs to know the relative order of the keys. If they are strings, then comparing them is trivial. Otherwise, if the key class implements Comparable it uses the compareTo() function. If not, it will eventually use System.identityHashCode() to determine the sort order. Good luck then.
Here is our Pixel class, with a bunch of hashing algorithms embedded inside:
public class Pixel implements Comparable<Pixel> { public enum HASH_ALGO { EFFECTIVE_JAVA { public int calc(Pixel pixel) { return pixel.x * 31 + pixel.y; // also in Objects.hash() } }, BAD_BIT_SHIFT_XOR { public int calc(Pixel pixel) { // precedence rules mean that this evaluates to // pixel.x << (16 + pixel.y). return pixel.x << 16 + pixel.y; } }, BIT_SHIFT_XOR { public int calc(Pixel pixel) { return pixel.x << 16 ^ pixel.y; // perfect hash } }, MULTIPLY_LARGE_PRIME { public int calc(Pixel pixel) { return pixel.x * 65521 + pixel.y; // perfect hash } }, MULTIPLY_JUST_ENOUGH { public int calc(Pixel pixel) { return pixel.x * 768 + pixel.y; // perfect hash } }, APPENDING_STRINGS { public int calc(Pixel pixel) { // no idea if perfect hash, also creates lots of garbage return (pixel.x + "-" + pixel.y).hashCode(); } }, TERRIBLE { public int calc(Pixel pixel) { // always the same value, just bad bad bad return 0; } }; public abstract int calc(Pixel pixel); } private static HASH_ALGO algo = HASH_ALGO.EFFECTIVE_JAVA; public static void setAlgo(HASH_ALGO algo) { Pixel.algo = algo; } private final int x; private final int y; public Pixel(int x, int y) { this.x = x; this.y = y; } public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; Pixel pixel = (Pixel) o; return x == pixel.x && y == pixel.y; } public int hashCode() { return algo.calc(this); } public int compareTo(Pixel o) { int result = compare(x, o.x); if (result == 0) { result = compare(y, o.y); } return result; } private int compare(int x1, int x2) { return (x1 < x2) ? -1 : ((x1 == x2) ? 0 : 1); } }
I don't want to be drawn into producing yet another broken microbenchmark, so instead I am just looking at the complexity of the various hashing approaches. Here is a class that prints out the various hashing distributions. For each hashing algorithm, it first prints out how many pixels have the same hashCode() value, and then the clash distribution in the buckets. If all the hashCode() values are distinct, then this is a "perfect hash function" and I print out "Perfect!!!". Similarly, if each bucket holds at most one entry, I also print out "Perfect!!!". My test assumes a 1024x768 size display.
import java.util.*; public class MapBucketFillTest { public static final int WIDTH = 1024; public static final int HEIGHT = 768; public static void main(String... args) throws Exception { for (Pixel.HASH_ALGO algo : Pixel.HASH_ALGO.values()) { testWith(algo); } } private static void testWith(Pixel.HASH_ALGO algo) throws NoSuchFieldException, IllegalAccessException { System.out.println("Testing with " + algo); Pixel.setAlgo(algo); Map<Pixel, Integer> pixels = new HashMap<Pixel, Integer>(); for (int x = 0; x < WIDTH; x++) { for (int y = 0; y < HEIGHT; y++) { pixels.put(new Pixel(x, y), 33); } } System.out.println("Hash clash print: "); Map<Integer, Integer> hashClashes = MapClashInspector.getHashClashDistribution(pixels); printClashes(hashClashes); System.out.println("Bucket entry clash print: "); Map<Integer, Integer> bucketClashes = MapClashInspector.getBucketClashDistribution(pixels); printClashes(bucketClashes); System.out.println(); } private static void printClashes( Map<Integer, Integer> clashes) { if (isPerfect(clashes)) { System.out.println("Perfect!!!"); } for (Map.Entry<Integer, Integer> e : clashes.entrySet()) { System.out.println(e.getKey() + ": " + e.getValue()); } } private static boolean isPerfect( Map<Integer, Integer> clashes) { Integer n = clashes.get(1); return n != null && n == WIDTH * HEIGHT; } }
Here are the results for Java 6 and 7. I've added some comments in bold.
java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609) Java HotSpot(TM) 64-Bit Server VM, mixed mode Testing with EFFECTIVE_JAVA Hash clash print: 1: 62 2: 62 3: 62 4: 62 5: 62 6: 62 7: 62 8: 62 9: 62 10: 62 11: 62 12: 62 13: 62 14: 62 15: 62 16: 62 17: 62 18: 62 19: 62 20: 62 21: 62 22: 62 23: 62 24: 7055 25: 24000 We see that a LOT of keys have the same hash code Bucket entry clash print: 0: 1016095 1: 62 2: 62 3: 62 4: 62 5: 62 6: 62 7: 62 8: 62 9: 62 10: 62 11: 62 12: 62 13: 62 14: 62 15: 62 16: 62 17: 62 18: 62 19: 62 20: 62 21: 62 22: 62 23: 62 24: 7055 25: 24000 Testing with BAD_BIT_SHIFT_XOR Hash clash print: 24: 6144 This one is much worse than "Effective Java" 48: 2944 72: 1536 96: 736 120: 352 144: 168 168: 144 192: 70 216: 34 240: 23 264: 5 288: 2 312: 1 336: 1 360: 32 384: 16 408: 8 432: 4 456: 2 480: 1 504: 1 744: 16 768: 8 792: 4 816: 2 840: 1 864: 1 1512: 8 1536: 4 1560: 2 1584: 1 1608: 1 3048: 4 3072: 2 3096: 1 3120: 1 6120: 2 6144: 1 6168: 1 12264: 1 12288: 1 24552: 1 25080: 1 Bucket entry clash print: 0: 1036324 24: 6110 48: 2936 72: 1533 96: 738 120: 351 144: 172 168: 144 192: 74 216: 34 240: 23 264: 5 288: 2 312: 1 336: 1 360: 32 384: 16 408: 8 432: 4 456: 2 480: 1 504: 1 744: 16 768: 8 792: 4 816: 2 840: 1 864: 1 1512: 8 1536: 4 1560: 2 1584: 1 1608: 1 3048: 4 3072: 2 3096: 1 3120: 1 6120: 2 6144: 1 6168: 1 12264: 1 12288: 1 24552: 1 25080: 1 Testing with BIT_SHIFT_XOR Hash clash print: Perfect!!! 1: 786432 Fantastic Bucket entry clash print: Perfect!!! 0: 262144 1: 786432 Testing with MULTIPLY_LARGE_PRIME Hash clash print: Perfect!!! 1: 786432 Fantastic Bucket entry clash print: 0: 436615 1: 453689 Pretty good, but not perfect 2: 142073 3: 16199 Testing with MULTIPLY_JUST_ENOUGH Hash clash print: Perfect!!! 1: 786432 Bucket entry clash print: Perfect!!! 0: 262144 1: 786432 Another perfect distribution Testing with APPENDING_STRINGS Hash clash print: Perfect!!! 1: 786432 Perfect, but only for that screen size. Bucket entry clash print: 0: 494197 1: 373778 Interestingly, this is not a perfect bucket fill! 2: 138513 3: 34219 4: 6610 5: 1060 6: 168 7: 25 8: 6 Testing with TERRIBLE This didn't finish, because it degraded to O(n) or worse
Let's have a look at Java 8. Again I will mark comments in bold. In all cases, the Hash clash will be identical, since the hash functions are the same for the various Java versions. The difference will be with the "Bucket entry clash". For brevity, I've left out the "Hash clash print" in the output.
Testing with EFFECTIVE_JAVA Bucket entry clash print: 0: 1016095 Same as Java 7 1: 62 2: 62 3: 62 4: 62 5: 62 6: 62 7: 62 8: 62 9: 62 10: 62 11: 62 12: 62 13: 62 14: 62 15: 62 16: 62 17: 62 18: 62 19: 62 20: 62 21: 62 22: 62 23: 62 24: 7055 25: 24000 Testing with BAD_BIT_SHIFT_XOR Bucket entry clash print: Hard to say. Probably worse than Java 7, BUT since the keys are stored in trees our complexity is still manageable. 0: 1038431 24: 4504 48: 2781 72: 918 96: 777 120: 278 144: 258 168: 95 192: 109 216: 37 240: 115 264: 25 288: 52 312: 14 336: 23 360: 8 384: 10 408: 3 432: 5 456: 33 480: 2 504: 16 528: 1 552: 9 600: 4 648: 2 696: 1 744: 1 864: 16 912: 8 960: 4 1008: 2 1056: 1 1104: 1 1656: 8 1704: 4 1752: 2 1800: 1 1848: 1 3216: 4 3264: 2 3312: 1 3360: 1 6312: 2 6360: 1 6408: 1 12480: 1 12528: 1 24792: 1 25080: 1 Testing with BIT_SHIFT_XOR Hash clash print: Perfect!!! 1: 786432 Bucket entry clash print: 0: 1032192 48: 16384 Much worse. Tests seem to indicate 3x slower. Testing with MULTIPLY_LARGE_PRIME Hash clash print: Perfect!!! 1: 786432 Bucket entry clash print: 0: 794368 1: 11097 2: 30199 3: 144588 Quite a bit worse than Java 7. 4: 61031 5: 6709 6: 584 Testing with MULTIPLY_JUST_ENOUGH Hash clash print: Perfect!!! 1: 786432 Bucket entry clash print: Perfect!!! 0: 262144 1: 786432 Same as Java 7. Testing with APPENDING_STRINGS Hash clash print: Perfect!!! 1: 786432 Bucket entry clash print: 0: 498369 1: 365636 Similar to Java 7. 2: 141387 3: 35688 4: 6572 5: 876 6: 46 7: 2 Testing with TERRIBLE Hash clash print: 786432: 1 Bucket entry clash print: 0: 1048575 786432: 1 Actually completed! It didn't in Java 7.
I have a few conclusions from this inspection of the new Java 8 HashMap:
I hope you enjoyed this newsletter and that it was useful to you.
Kind regards from a windy Crete
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.