Abstract: In Java 2, the computational time complexity of String#hashCode() became linear, as it used every character to calculate the hash. Java 3 cached the hash code, so that it would not need to be recalculated every time we called hashCode(), except in the case that the hash code was zero. Java 13 finally fixes that inefficiency with an additional boolean field.
Welcome to the 277th edition of The Java(tm) Specialists' Newsletter, written on my couch, safely tucked away on the Island of Crete. Like most of the world, we are in lock down and can only leave the house for absolute necessities. Fortunately exercise is considered important in Greece (at the moment), so I'm still doing my daily runs, but far away from crowds. How are you doing? Please let me know. I promise I will personally respond to each and every email that I get.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
In the olden days, when movies were still black and white, and we shook hands without a care in the world, Java was invented with a constant time hashCode() implementation. Yes, it did not matter how long the string was, it was always calculated in at most 16 steps. Since hashCode() was only using a sample of the characters in the string, similar strings had a good chance of having a clashing hashCode().
In Java 2, they changed this to use every character, and so instead of constant time, it became linear. The longer the string, the longer it took to calculate the hash code. If we calculated the hash code of one string many times, we would each time have to iterate over the characters and generate the hash code.
In Java 3, they optimized this by caching the hash code the first time that we called hashCode(). Since java.lang.String does not have any synchronization, this causes a benign data race. Two threads calling hashCode() at the same time would potentially overwrite each other's calculation result. It is benign, because the result is idempotent. No matter in which order the two threads write, since the results are exactly the same, and the hash field is an int, and writing and reading 32-bit values is guaranteed to be atomic, String is still thread-safe despite the race.
Here is the hashCode() method from Java 3:
public int hashCode() { int h = hash; if (h == 0) { int off = offset; char val[] = value; int len = count; for (int i = 0; i < len; i++) h = 31*h + val[off++]; hash = h; } return h; }
The only time where the cached hash code did not help us was when the hashCode() evaluated to 0. In that case, we would always need to recalculate the hash code, regardless of how often it was called.
The algorithm used to calculate the hash is 31*h + c, where h is the running total and c the current character. Thus, if a string has a hash code of 0, and we append the string to itself, it will also have a hash code of 0. For example:
public class StringHashZero { public static void main(String... args) { String s = "ARcZguv"; System.out.println(s.hashCode()); System.out.println((s + s).hashCode()); System.out.println((s + s + s).hashCode()); System.out.println((s + s + s + s).hashCode()); } }
The output is four 0's:
0 0 0 0
It should be obvious that we can easily generate arbitrarily long strings that have the hash code of 0.
However, the way that java.lang.String determined whether we had already calculated the hash code, was to see whether the cached value was not equal to zero. If it was zero, it was assumed to have not been calculated. Thus if we have strings with hash code of 0, the caching does not work and the time complexity stays linear, regardless of how often we call hashCode().
In Java 13, they improved the calculation by adding an
additional boolean hashIsZero
, defaulting to
false. This makes cached hash codes also effective when the
hash code value turns out to be zero.
To demonstrate the difference, here is a little benchmark that creates strings of various sizes from 0 up to 700_000, always with hash code of zero.
package eu.javaspecialists.tjsn.examples.issue277; import org.openjdk.jmh.annotations.*; import java.util.concurrent.*; @Fork(value = 3) @Warmup(iterations = 5, time = 3) @Measurement(iterations = 10, time = 3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class HashCodeZeroBenchmark { private static final String ZERO_HASH = "ARcZguv"; @Param({"-1", "0", "1", "2", "3", "4", "5"}) private int powerOfTen; private String s; @Setup public void setup() { int chunks = (int) Math.pow(10, powerOfTen); StringBuilder sb = new StringBuilder( ZERO_HASH.length() * chunks ); for (int i = 0; i < chunks; i++) { sb.append(ZERO_HASH); } s = sb.toString(); } @Benchmark public int hashCodeString() { return s.hashCode(); } }
I ran this on my MacBook Pro 2.9 GHz 6-Core Intel Core i9 with JMH 1.23. In this newsletter I will show results for Zulu OpenJDK Java 11.0.6 and Oracle OpenJDK 13.0.2. The change occurred in Java 13, so any version after that will also have the performance boost. Versions from Java 3 through 12 will have the slower, linear calculation. Also, I show results to 2 significant digits. The actual numbers are not as important as the computational time complexity.
Results for Zulu OpenJDK 11.0.6:
String length | ns/op |
---|---|
0 | 2.4 |
7 | 6.8 |
70 | 53 |
700 | 510 |
7_000 | 5_200 |
70_000 | 52_000 |
700_000 | 520_000 |
Results for Oracle OpenJDK 13.0.2:
String length | ns/op |
---|---|
0 | 2.1 |
7 | 2.1 |
70 | 2.1 |
700 | 2.1 |
7_000 | 2.2 |
70_000 | 2.2 |
700_000 | 2.2 |
As we can see, the time to call hashCode() repeatedly on the same string is no longer dependent on the length of the string.
Kind regards from Crete
Heinz
P.S. Please let me know how you are doing :-) Stay safe.
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.