Running on Java 24-ea+24-2960 (Preview)
Home of The JavaSpecialists' Newsletter

235Checking HashMaps with MapClashInspector

Author: Dr. Heinz M. KabutzDate: 2016-02-29Java Version: 5Category: Performance
 

Abstract: Java 8 HashMap has been optimized to avoid denial of service attacks with many distinct keys containing identical hash codes. Unfortunately performance might degrade if you use your own keys. In this newsletter we show a tool that you can use to inspect your HashMap and view the key distribution within the buckets.

 

Welcome to the 235th edition of The Java(tm) Specialists' Newsletter, sent on the 29th February. I looked back through 16 years of writing newsletters, and this is the first time I'm sending it out on a leap day. For fun, you should ask some non-programmers to tell you how the leap-year rule works and see if anyone knows :-) (Hint: it's not year%4 == 0)

Whilst spending a dark Christmas in England, I had the opportunity to read How to Fail at Almost Everything and Still Win Big: Kind of the Story of My Life [ISBN 1591846919] , an autobiography by our favourite Dilbert cartoonist. The book changed my life. As soon as I got back to Crete, I set up a salad bar in my fridge. So far this year I've lost 6 kg despite three business trips. I did my fastest 3 km run ever yesterday. You might not realize this, but our jobs as Java programmers are high risk. Cardiovascular diseases are the leading cause of death globally. Adams' book helped me get my life in order in a matter of weeks.

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

Checking HashMaps with MapClashInspector

In the beginning, we had Hashtable. It was synchronized. It had a threading bug that was only removed in Java 7 (exercise for the reader is to find it). It did not support compound operations. Hashtable used the traditional approach of a sparse array containing the table entries, each offset by the remainder of the key's hashCode() function.

When HashMap was added in Java 2, it used a similar way of assigning entries to buckets using the remainder operator (%). In Java 4, Joshua Bloch introduced a new approach using bit masking (&). It is becoming common knowledge that divide is one of the most expensive arithmetic operation that our CPU can do. Think back to junior school, when your sadistic maths teacher would punish you with long division: "Jimmy, what is 151234 / 124? You have 5 minutes!" Besides satisfying a deep-seated desire to inflict pain on others, my teachers also kept on repeating the same mantra: "You will not always have a calculator with you!" Um. iPhone anyone? Think back on those dark days where you were forced to do nasty divisions and you will know just how our CPU feels when we throw % at it.

Bit masking is far more pleasant. It's like asking us to do a long division - by a number that's a power of 10. So what is 123211237126 / 10000? Sweet, it's 12321123.7126 - just move the decimal by as many places as you have zeros. It is thus self-explanatory that & would be an improvement over %.

There is just one catch. Bit masking is far more sensitive to hash codes that have mostly the higher bits set. Bloch solved it in the first version of Java 1.4 by doing a second hash on our hash function. His hash didn't lose bits, so if you had written a "perfect hash function" (every distinct object also has a unique hash code value), then you would still have a perfect hash function after the second hash. Here is how version 1 of the hash() looked (required some serious code archeology on my side to dig that out):

int hash(Object key) {
  int h = key == null ? 0 : key.hashCode();
  h += ~(h << 9);
  h ^= (h >>> 14);
  h += (h << 4);
  h ^= (h >>> 10);
  return h;
}

When I look at code like that I go "Yep, should probably work". I have no idea why it would, but it looks complicated enough that the bits should be moved along down without losing any. It turned out, however, that for certain keys, there were quite a few collisions in the table buckets, even with a perfect hash function. The HashMap at the time used a linked list of entries when we had collisions. If you were unlucky, you could end up with your HashMap or Hashtable degrading to O(n).

A few releases later, they improved this hash() inside the HashMap to:

int hash(Object key) {
  int h = key == null ? 0 : key.hashCode();
  h ^= (h >>> 20) ^ (h >>> 12);
  return h ^ (h >>> 7) ^ (h >>> 4);
}

Again, all I could say was: "Looks better ... I guess."

Turns out, in all the tests I did, the newer function was indeed better and performance improved.

Tree of Entries Inside HashMap

For a while now already, there were some nasty denial-of-service attacks going around where you would send lots of String objects to a server that would always result in the same identical hash code. For extra effect, you would ensure that the hashCode() of the String would equal 0, thus the cached value did not work properly and it would recalculate every time.

To address this issue, HashMap was modified in Java 8 to store nodes inside a binary tree if the number of clashes exceeded 11 (Thanks to Olaf Fricke for pointing this out). This would slash the complexity to O(log n). The denial-of-service attacks stopped. Life was good. So good in fact, that it was decided to replace that rather complicated "Bloch approved" hash() shown above with a simpler, faster hash():

int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

"Um", I thought, "that's just gotta be worse for the key distribution!"

The question was - how could I test this? I filled a hash map with entries and then tried to view this with a debugger. Unfortunately IntelliJ IDEA did not let me peep into the gory details of HashMap, but instead presented me with a beautifully sanitised view where no clashes occurred.

Since I was really interested in seeing how many clashes I could have per bucket, even with a perfect hash code, I wrote this little utility to make sense of what is happening in the buckets:

import java.lang.reflect.*;
import java.util.*;

public class MapClashInspector {
  private interface MapProcessor {
    void beginBucket();

    void process(Map.Entry<?, ?> node);

    void endBucket(Map<Integer, Integer> count);
  }

  /**
   * Returns a map showing as key the number of clashes and
   * as value the number of entries with identical hash codes.
   * With a "perfect" hash function, the map will contain only
   * one entry with 1 as a key and the number of entries in the
   * map as a value.
   */
  public static Map<Integer, Integer> getHashClashDistribution(
      Map<?, ?> map)
      throws NoSuchFieldException, IllegalAccessException {
    return getBucketDistribution(map, new MapProcessor() {
      private final Map<Integer, Integer> numberOfClashes =
          new HashMap<Integer, Integer>();

      public void beginBucket() {
        numberOfClashes.clear();
      }

      public void process(Map.Entry<?, ?> node) {
        increment(numberOfClashes, node.getKey().hashCode());
      }

      public void endBucket(Map<Integer, Integer> count) {
        for (Integer val : numberOfClashes.values()) {
          increment(count, val);
        }
      }
    });
  }

  /**
   * Returns a map showing as key the number of clashes and
   * as value the number of buckets with this number of clashes.
   * In a "perfect" distribution, we would have
   * 1->numberOfEntriesInMap.  The worst possible distribution
   * is numberOfEntriesInMap->1, where all the entries go into a
   * single bucket.  It also shows the number of empty buckets.
   * The Java 8 HashMap copes well with clashes, but earlier
   * versions would become very slow due to O(n) lookup.
   */
  public static Map<Integer, Integer> getBucketClashDistribution(
      Map<?, ?> map)
      throws NoSuchFieldException, IllegalAccessException {
    return getBucketDistribution(map, new MapProcessor() {
      private int size;

      public void beginBucket() {
        size = 0;
      }

      public void process(Map.Entry<?, ?> node) {
        size++;
      }

      public void endBucket(Map<Integer, Integer> count) {
        increment(count, size);
      }
    });
  }

  /**
   * Increment the value if already exists; otherwise set to 1.
   */
  private static void increment(
      Map<Integer, Integer> map, int size) {
    Integer counter = map.get(size);
    if (counter == null) {
      map.put(size, 1);
    } else {
      map.put(size, counter + 1);
    }
  }

  private static Map<Integer, Integer> getBucketDistribution(
      Map<?, ?> map, MapProcessor processor)
    // Since Java 7, we can throw ReflectiveOperationException
      throws NoSuchFieldException, IllegalAccessException {
    Map.Entry<?, ?>[] table = getTable(map);
    Field nextNodeField = getNextField(table);
    Map<Integer, Integer> numberPerBucket =
        new TreeMap<Integer, Integer>();
    for (Map.Entry<?, ?> node : table) {
      processor.beginBucket();
      while (node != null) {
        processor.process(node);
        node = (Map.Entry<?, ?>) nextNodeField.get(node);
      }
      processor.endBucket(numberPerBucket);
    }
    return numberPerBucket;
  }

  private static Map.Entry<?, ?>[] getTable(Map<?, ?> map)
      throws NoSuchFieldException, IllegalAccessException {
    Field tableField = map.getClass().getDeclaredField("table");
    tableField.setAccessible(true);
    return (Map.Entry<?, ?>[]) tableField.get(map);
  }

  private static Field getNextField(Object table)
      throws NoSuchFieldException {
    Class<?> nodeType = table.getClass().getComponentType();
    Field nextNodeField = nodeType.getDeclaredField("next");
    nextNodeField.setAccessible(true);
    return nextNodeField;
  }
}

A few constraints about the code above. I wanted to support Java 6, so that we can compare the evolution of java.util.HashMap through generations of Java. Thus I don't use the lovely new Streams API to process the table and I even give the diamond operator <> a miss.

When we run this on different versions of Java, we see how the distributions have changed. Prior to Java 8, a perfect hash function would mean a pretty good chance that the entries would be held in separate buckets. Now, we need to be more careful. For example, we have the famous Effective Java approach that is used in Objects.hash(), where we incrementally multiply the current value by the prime number 31, before adding the new value. This works brilliantly with Strings, but not so well in other scenarios. The reason is that most characters that we use in Java fall within a range of 26, thanks to the Latin alphabet. Thus multiplying by 31 is big enough to reduce the chance of clashes.

In order to test different approaches, I wrote a Pixel class, containing fields "x" and "y". Now please don't complain about these variable names :-) They are the x and y coordinates of a pixel. It would be hard to come up with better or more descriptive names. I also used a bunch of hash code functions that we can select for each run.

Also, you will notice that my Pixel class implements Comparable. Since Java 8 it is VERY IMPORTANT to always make your custom keys for HashMap implement Comparable. HashMap now puts your entries into a tree to avoid degradation to O(n) when the entries map to the same bucket. It somehow needs to know the relative order of the keys. If they are strings, then comparing them is trivial. Otherwise, if the key class implements Comparable it uses the compareTo() function. If not, it will eventually use System.identityHashCode() to determine the sort order. Good luck then.

Here is our Pixel class, with a bunch of hashing algorithms embedded inside:

public class Pixel implements Comparable<Pixel> {
  public enum HASH_ALGO {
    EFFECTIVE_JAVA {
      public int calc(Pixel pixel) {
        return pixel.x * 31 + pixel.y; // also in Objects.hash()
      }
    },
    BAD_BIT_SHIFT_XOR {
      public int calc(Pixel pixel) {
        // precedence rules mean that this evaluates to
        // pixel.x << (16 + pixel.y).
        return pixel.x << 16 + pixel.y;
      }
    },
    BIT_SHIFT_XOR {
      public int calc(Pixel pixel) {
        return pixel.x << 16 ^ pixel.y; // perfect hash
      }
    },
    MULTIPLY_LARGE_PRIME {
      public int calc(Pixel pixel) {
        return pixel.x * 65521 + pixel.y; // perfect hash
      }
    },
    MULTIPLY_JUST_ENOUGH {
      public int calc(Pixel pixel) {
        return pixel.x * 768 + pixel.y; // perfect hash
      }
    },
    APPENDING_STRINGS {
      public int calc(Pixel pixel) {
        // no idea if perfect hash, also creates lots of garbage
        return (pixel.x + "-" + pixel.y).hashCode();
      }
    },
    TERRIBLE {
      public int calc(Pixel pixel) {
        // always the same value, just bad bad bad
        return 0;
      }
    };

    public abstract int calc(Pixel pixel);
  }

  private static HASH_ALGO algo = HASH_ALGO.EFFECTIVE_JAVA;

  public static void setAlgo(HASH_ALGO algo) {
    Pixel.algo = algo;
  }

  private final int x;
  private final int y;

  public Pixel(int x, int y) {
    this.x = x;
    this.y = y;
  }

  public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;

    Pixel pixel = (Pixel) o;

    return x == pixel.x && y == pixel.y;
  }

  public int hashCode() {
    return algo.calc(this);
  }

  public int compareTo(Pixel o) {
    int result = compare(x, o.x);
    if (result == 0) {
      result = compare(y, o.y);
    }
    return result;
  }

  private int compare(int x1, int x2) {
    return (x1 < x2) ? -1 : ((x1 == x2) ? 0 : 1);
  }
}

I don't want to be drawn into producing yet another broken microbenchmark, so instead I am just looking at the complexity of the various hashing approaches. Here is a class that prints out the various hashing distributions. For each hashing algorithm, it first prints out how many pixels have the same hashCode() value, and then the clash distribution in the buckets. If all the hashCode() values are distinct, then this is a "perfect hash function" and I print out "Perfect!!!". Similarly, if each bucket holds at most one entry, I also print out "Perfect!!!". My test assumes a 1024x768 size display.

import java.util.*;

public class MapBucketFillTest {
  public static final int WIDTH = 1024;
  public static final int HEIGHT = 768;

  public static void main(String... args) throws Exception {
    for (Pixel.HASH_ALGO algo : Pixel.HASH_ALGO.values()) {
      testWith(algo);
    }
  }

  private static void testWith(Pixel.HASH_ALGO algo)
      throws NoSuchFieldException, IllegalAccessException {
    System.out.println("Testing with " + algo);
    Pixel.setAlgo(algo);
    Map<Pixel, Integer> pixels = new HashMap<Pixel, Integer>();
    for (int x = 0; x < WIDTH; x++) {
      for (int y = 0; y < HEIGHT; y++) {
        pixels.put(new Pixel(x, y), 33);
      }
    }

    System.out.println("Hash clash print: ");
    Map<Integer, Integer> hashClashes =
        MapClashInspector.getHashClashDistribution(pixels);
    printClashes(hashClashes);

    System.out.println("Bucket entry clash print: ");
    Map<Integer, Integer> bucketClashes =
        MapClashInspector.getBucketClashDistribution(pixels);
    printClashes(bucketClashes);

    System.out.println();
  }

  private static void printClashes(
      Map<Integer, Integer> clashes) {
    if (isPerfect(clashes)) {
      System.out.println("Perfect!!!");
    }
    for (Map.Entry<Integer, Integer> e : clashes.entrySet()) {
      System.out.println(e.getKey() + ": " + e.getValue());
    }
  }

  private static boolean isPerfect(
      Map<Integer, Integer> clashes) {
    Integer n = clashes.get(1);
    return n != null && n == WIDTH * HEIGHT;
  }
}

Here are the results for Java 6 and 7. I've added some comments in bold.

java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
Java HotSpot(TM) 64-Bit Server VM, mixed mode
Testing with EFFECTIVE_JAVA
Hash clash print:
1: 62
2: 62
3: 62
4: 62
5: 62
6: 62
7: 62
8: 62
9: 62
10: 62
11: 62
12: 62
13: 62
14: 62
15: 62
16: 62
17: 62
18: 62
19: 62
20: 62
21: 62
22: 62
23: 62
24: 7055
25: 24000 We see that a LOT of keys have the same hash code
Bucket entry clash print:
0: 1016095
1: 62
2: 62
3: 62
4: 62
5: 62
6: 62
7: 62
8: 62
9: 62
10: 62
11: 62
12: 62
13: 62
14: 62
15: 62
16: 62
17: 62
18: 62
19: 62
20: 62
21: 62
22: 62
23: 62
24: 7055
25: 24000

Testing with BAD_BIT_SHIFT_XOR
Hash clash print:
24: 6144 This one is much worse than "Effective Java"
48: 2944
72: 1536
96: 736
120: 352
144: 168
168: 144
192: 70
216: 34
240: 23
264: 5
288: 2
312: 1
336: 1
360: 32
384: 16
408: 8
432: 4
456: 2
480: 1
504: 1
744: 16
768: 8
792: 4
816: 2
840: 1
864: 1
1512: 8
1536: 4
1560: 2
1584: 1
1608: 1
3048: 4
3072: 2
3096: 1
3120: 1
6120: 2
6144: 1
6168: 1
12264: 1
12288: 1
24552: 1
25080: 1
Bucket entry clash print:
0: 1036324
24: 6110
48: 2936
72: 1533
96: 738
120: 351
144: 172
168: 144
192: 74
216: 34
240: 23
264: 5
288: 2
312: 1
336: 1
360: 32
384: 16
408: 8
432: 4
456: 2
480: 1
504: 1
744: 16
768: 8
792: 4
816: 2
840: 1
864: 1
1512: 8
1536: 4
1560: 2
1584: 1
1608: 1
3048: 4
3072: 2
3096: 1
3120: 1
6120: 2
6144: 1
6168: 1
12264: 1
12288: 1
24552: 1
25080: 1

Testing with BIT_SHIFT_XOR
Hash clash print:
Perfect!!!
1: 786432 Fantastic
Bucket entry clash print:
Perfect!!!
0: 262144
1: 786432

Testing with MULTIPLY_LARGE_PRIME
Hash clash print:
Perfect!!!
1: 786432 Fantastic
Bucket entry clash print:
0: 436615
1: 453689 Pretty good, but not perfect
2: 142073
3: 16199

Testing with MULTIPLY_JUST_ENOUGH
Hash clash print:
Perfect!!!
1: 786432
Bucket entry clash print:
Perfect!!!
0: 262144
1: 786432 Another perfect distribution

Testing with APPENDING_STRINGS
Hash clash print:
Perfect!!!
1: 786432 Perfect, but only for that screen size.
Bucket entry clash print:
0: 494197
1: 373778 Interestingly, this is not a perfect bucket fill!
2: 138513
3: 34219
4: 6610
5: 1060
6: 168
7: 25
8: 6

Testing with TERRIBLE
This didn't finish, because it degraded to O(n) or worse

Let's have a look at Java 8. Again I will mark comments in bold. In all cases, the Hash clash will be identical, since the hash functions are the same for the various Java versions. The difference will be with the "Bucket entry clash". For brevity, I've left out the "Hash clash print" in the output.

Testing with EFFECTIVE_JAVA
Bucket entry clash print:
0: 1016095  Same as Java 7
1: 62
2: 62
3: 62
4: 62
5: 62
6: 62
7: 62
8: 62
9: 62
10: 62
11: 62
12: 62
13: 62
14: 62
15: 62
16: 62
17: 62
18: 62
19: 62
20: 62
21: 62
22: 62
23: 62
24: 7055
25: 24000

Testing with BAD_BIT_SHIFT_XOR
Bucket entry clash print: Hard to say.  Probably worse than
  Java 7, BUT since the keys are stored in trees our complexity
  is still manageable.
0: 1038431
24: 4504
48: 2781
72: 918
96: 777
120: 278
144: 258
168: 95
192: 109
216: 37
240: 115
264: 25
288: 52
312: 14
336: 23
360: 8
384: 10
408: 3
432: 5
456: 33
480: 2
504: 16
528: 1
552: 9
600: 4
648: 2
696: 1
744: 1
864: 16
912: 8
960: 4
1008: 2
1056: 1
1104: 1
1656: 8
1704: 4
1752: 2
1800: 1
1848: 1
3216: 4
3264: 2
3312: 1
3360: 1
6312: 2
6360: 1
6408: 1
12480: 1
12528: 1
24792: 1
25080: 1

Testing with BIT_SHIFT_XOR
Hash clash print:
Perfect!!!
1: 786432
Bucket entry clash print:
0: 1032192
48: 16384 Much worse.  Tests seem to indicate 3x slower.

Testing with MULTIPLY_LARGE_PRIME
Hash clash print:
Perfect!!!
1: 786432
Bucket entry clash print:
0: 794368
1: 11097
2: 30199
3: 144588 Quite a bit worse than Java 7.
4: 61031
5: 6709
6: 584

Testing with MULTIPLY_JUST_ENOUGH
Hash clash print:
Perfect!!!
1: 786432
Bucket entry clash print:
Perfect!!!
0: 262144
1: 786432 Same as Java 7.

Testing with APPENDING_STRINGS
Hash clash print:
Perfect!!!
1: 786432
Bucket entry clash print:
0: 498369
1: 365636 Similar to Java 7.
2: 141387
3: 35688
4: 6572
5: 876
6: 46
7: 2

Testing with TERRIBLE
Hash clash print:
786432: 1
Bucket entry clash print:
0: 1048575
786432: 1 Actually completed!  It didn't in Java 7.

I have a few conclusions from this inspection of the new Java 8 HashMap:

  1. Java 8 better than Java 7 when we have lots of keys with identical hash codes or that happen to be in the same bucket.
  2. Map keys should implement Comparable, otherwise the "TERRIBLE" algorithm above would also not complete my test in Java 8.
  3. Apparently the majority of HashMaps have Strings for keys, for which the Java 8 hash() function is probably good enough.
  4. Creating a perfect hash function for your class is not enough. You also need to check how well it copes with the Java 8 hash() function.

I hope you enjoyed this newsletter and that it was useful to you.

Kind regards from a windy Crete

Heinz

 

Comments

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Related Articles

Browse the Newsletter Archive

About the Author

Heinz Kabutz Java Conference Speaker

Java Champion, author of the Javaspecialists Newsletter, conference speaking regular... About Heinz

Superpack '23

Superpack '23 Our entire Java Specialists Training in one huge bundle more...

Free Java Book

Dynamic Proxies in Java Book
Java Training

We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.

Java Consulting

We can help make your Java application run faster and trouble-shoot concurrency and performance bugs...