Running on Java 22-ea+27-2262 (Preview)
Home of The JavaSpecialists' Newsletter

277Strings with Zero HashCode

Author: Dr Heinz M. KabutzDate: 2020-03-30Java Version: 13Category: Performance
 

Abstract: In Java 2, the computational time complexity of String#hashCode() became linear, as it used every character to calculate the hash. Java 3 cached the hash code, so that it would not need to be recalculated every time we called hashCode(), except in the case that the hash code was zero. Java 13 finally fixes that inefficiency with an additional boolean field.

 

Welcome to the 277th edition of The Java(tm) Specialists' Newsletter, written on my couch, safely tucked away on the Island of Crete. Like most of the world, we are in lock down and can only leave the house for absolute necessities. Fortunately exercise is considered important in Greece (at the moment), so I'm still doing my daily runs, but far away from crowds. How are you doing? Please let me know. I promise I will personally respond to each and every email that I get.

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

Strings with Zero HashCode

In the olden days, when movies were still black and white, and we shook hands without a care in the world, Java was invented with a constant time hashCode() implementation. Yes, it did not matter how long the string was, it was always calculated in at most 16 steps. Since hashCode() was only using a sample of the characters in the string, similar strings had a good chance of having a clashing hashCode().

In Java 2, they changed this to use every character, and so instead of constant time, it became linear. The longer the string, the longer it took to calculate the hash code. If we calculated the hash code of one string many times, we would each time have to iterate over the characters and generate the hash code.

In Java 3, they optimized this by caching the hash code the first time that we called hashCode(). Since java.lang.String does not have any synchronization, this causes a benign data race. Two threads calling hashCode() at the same time would potentially overwrite each other's calculation result. It is benign, because the result is idempotent. No matter in which order the two threads write, since the results are exactly the same, and the hash field is an int, and writing and reading 32-bit values is guaranteed to be atomic, String is still thread-safe despite the race.

Here is the hashCode() method from Java 3:

public int hashCode() {
  int h = hash;
  if (h == 0) {
    int off = offset;
    char val[] = value;
    int len = count;

    for (int i = 0; i < len; i++)
      h = 31*h + val[off++];
    hash = h;
  }
  return h;
}

The only time where the cached hash code did not help us was when the hashCode() evaluated to 0. In that case, we would always need to recalculate the hash code, regardless of how often it was called.

The algorithm used to calculate the hash is 31*h + c, where h is the running total and c the current character. Thus, if a string has a hash code of 0, and we append the string to itself, it will also have a hash code of 0. For example:

public class StringHashZero {
  public static void main(String... args) {
    String s = "ARcZguv";
    System.out.println(s.hashCode());
    System.out.println((s + s).hashCode());
    System.out.println((s + s + s).hashCode());
    System.out.println((s + s + s + s).hashCode());
  }
}

The output is four 0's:

0  	
0  	
0  	
0  	

It should be obvious that we can easily generate arbitrarily long strings that have the hash code of 0.

However, the way that java.lang.String determined whether we had already calculated the hash code, was to see whether the cached value was not equal to zero. If it was zero, it was assumed to have not been calculated. Thus if we have strings with hash code of 0, the caching does not work and the time complexity stays linear, regardless of how often we call hashCode().

In Java 13, they improved the calculation by adding an additional boolean hashIsZero, defaulting to false. This makes cached hash codes also effective when the hash code value turns out to be zero.

To demonstrate the difference, here is a little benchmark that creates strings of various sizes from 0 up to 700_000, always with hash code of zero.

package eu.javaspecialists.tjsn.examples.issue277;

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.*;

@Fork(value = 3)
@Warmup(iterations = 5, time = 3)
@Measurement(iterations = 10, time = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class HashCodeZeroBenchmark {
  private static final String ZERO_HASH = "ARcZguv";
  @Param({"-1", "0", "1", "2", "3", "4", "5"})
  private int powerOfTen;
  private String s;

  @Setup
  public void setup() {
    int chunks = (int) Math.pow(10, powerOfTen);
    StringBuilder sb = new StringBuilder(
        ZERO_HASH.length() * chunks
    );
    for (int i = 0; i < chunks; i++) {
      sb.append(ZERO_HASH);
    }
    s = sb.toString();
  }

  @Benchmark
  public int hashCodeString() {
    return s.hashCode();
  }
}

I ran this on my MacBook Pro 2.9 GHz 6-Core Intel Core i9 with JMH 1.23. In this newsletter I will show results for Zulu OpenJDK Java 11.0.6 and Oracle OpenJDK 13.0.2. The change occurred in Java 13, so any version after that will also have the performance boost. Versions from Java 3 through 12 will have the slower, linear calculation. Also, I show results to 2 significant digits. The actual numbers are not as important as the computational time complexity.

Results for Zulu OpenJDK 11.0.6:

String length ns/op
0 2.4
7 6.8
70 53
700 510
7_000 5_200
70_000 52_000
700_000 520_000

Results for Oracle OpenJDK 13.0.2:

String length ns/op
0 2.1
7 2.1
70 2.1
700 2.1
7_000 2.2
70_000 2.2
700_000 2.2

As we can see, the time to call hashCode() repeatedly on the same string is no longer dependent on the length of the string.

Kind regards from Crete

Heinz

P.S. Please let me know how you are doing :-) Stay safe.

 

Comments

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Related Articles

Browse the Newsletter Archive

About the Author

Heinz Kabutz Java Conference Speaker

Java Champion, author of the Javaspecialists Newsletter, conference speaking regular... About Heinz

Superpack '23

Superpack '23 Our entire Java Specialists Training in one huge bundle more...

Free Java Book

Dynamic Proxies in Java Book
Java Training

We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.

Java Consulting

We can help make your Java application run faster and trouble-shoot concurrency and performance bugs...