Abstract: Heavily contended concurrent constructs may benefit from a small pause after a CAS failure, but this may lead to worse performance if we do more work in between each update attempt. In this follow-up newsletter, we show how adding CPU consumption can change our performance characteristics.
I do not usually send a follow-up newsletter, but this one is too important to keep waiting for a long time. In my last newsletter, I showed how parking on CAS failure could improve our performance on heavily contended CAS code. I also pointed out twice how one would need to carefully tune and choose the delay according to specific performance characteristics of one's system. But the message still did not come through. As a result, I decided to send a follow-up email to my previous newsletter.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
Martin Thompson kindly pointed out that instead of using the park function, we really should use the "86 PAUSE instruction inside spin loops to avoid speculative execution that cause latency in such tests and reduce the waste of CPU resource when spinning." Unfortunately this function is not available in Java. Maybe in future we can have a Thread.pause(). Thread.yield() is not the same, as it is OS dependent. Here is the C++ code from the file jvm.cpp:
JVM_ENTRY(void, JVM_Yield(JNIEnv *env, jclass threadClass)) JVMWrapper("JVM_Yield"); if (os::dont_yield()) return; #ifndef USDT2 HS_DTRACE_PROBE0(hotspot, thread__yield); #else /* USDT2 */ HOTSPOT_THREAD_YIELD(); #endif /* USDT2 */ // When ConvertYieldToSleep is off (default), this matches the // classic VM use of yield. // Critical for similar threading behaviour if (ConvertYieldToSleep) { os::sleep(thread, MinSleepInterval, false); } else { os::yield(); } JVM_END
The flag -XX:+DontYieldALot
effectively turns the
Thread.yield() into a no-op on all the platforms I looked at.
In order to demonstrate the danger of having a delay that is too long in a less contended lock, I modified the benchmark to add some busy CPU work in between each of the update methods using the JMH BlackHole.consumeCPU() method. Here is my new Benchmarking class. The larger the TOKENS environment variable, the more CPU work is done:
package eu.javaspecialists.microbenchmarks.numberrange; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.logic.*; import static java.util.concurrent.TimeUnit.*; @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(NANOSECONDS) @Warmup(iterations = 5, time = 1, timeUnit = SECONDS) @Measurement(iterations = 10, time = 1, timeUnit = SECONDS) @Fork(5) @State(Scope.Benchmark) @Threads(8) public class Microbenchmark0 { public static final int TOKENS = Integer.getInteger("TOKENS"); private final NumberRangeSynchronized nrSync = new NumberRangeSynchronized(); private final NumberRangeAtomic nrAtomic = new NumberRangeAtomic(); private final NumberRangeAtomicWithPark nrAtomicPark = new NumberRangeAtomicWithPark(); @GenerateMicroBenchmark public void test_0_1_synchronized() { nrSync.setUpper(100); BlackHole.consumeCPU(TOKENS); nrSync.setLower(10); BlackHole.consumeCPU(TOKENS); nrSync.setLower(50); BlackHole.consumeCPU(TOKENS); nrSync.setUpper(200); BlackHole.consumeCPU(TOKENS); nrSync.setLower(30); BlackHole.consumeCPU(TOKENS); } @GenerateMicroBenchmark public void test_0_2_atomic() { nrAtomic.setUpper(100); BlackHole.consumeCPU(TOKENS); nrAtomic.setLower(10); BlackHole.consumeCPU(TOKENS); nrAtomic.setLower(50); BlackHole.consumeCPU(TOKENS); nrAtomic.setUpper(200); BlackHole.consumeCPU(TOKENS); nrAtomic.setLower(30); BlackHole.consumeCPU(TOKENS); } @GenerateMicroBenchmark public void test_0_3_atomic_park() { nrAtomicPark.setUpper(100); BlackHole.consumeCPU(TOKENS); nrAtomicPark.setLower(10); BlackHole.consumeCPU(TOKENS); nrAtomicPark.setLower(50); BlackHole.consumeCPU(TOKENS); nrAtomicPark.setUpper(200); BlackHole.consumeCPU(TOKENS); nrAtomicPark.setLower(30); BlackHole.consumeCPU(TOKENS); } }
In order to show the data without too much clutter, I've left out the "mean error" values, but all the raw data is available on demand. These results are when we use 180 microseconds as a backoff delay.
-DTOKEN= 0 10 20 40 80 160 320 640 1280 2560 synchronized 1623 2626 3444 5278 5924 6182 5429 8146 15067 29407 atomic 4229 3412 3509 3295 2733 2053 3663 7160 13947 27854 atomic_park 485 844 1498 3507 6504 9917 14317 18814 27844 38242
The atomic parking version is faster when the update method is extremely highly contended. However, as the contention lessens, the parking starts to take over as a main denominator. We see that in some cases, the parking version is about 4x slower than the plain CAS version.
I also ran the same experiment with less threads than 8. Here are the results with 4 threads on my i7 machine:
-DTOKEN= 0 10 20 40 80 160 320 640 1280 2560 synchronized 804 1227 1737 2593 2706 2333 3517 6775 13233 26155 atomic 1604 1515 1484 1190 1013 1749 3314 6490 12878 25539 atomic_park 282 476 753 1857 2260 2060 5457 7307 13847 27159
And here are the results with 2 threads on my i7 machine:
-DTOKEN= 0 10 20 40 80 160 320 640 1280 2560 synchronized 390 605 555 1027 1011 1797 3346 6479 12617 24929 atomic 615 583 692 472 870 1651 3162 6268 12353 24670 atomic_park 126 197 338 697 1093 2202 3531 7303 12723 25001
Running the same experiment with 8 threads on my 2-4-1 server (total of 8 cores) gave the following results:
-DTOKEN= 0 10 20 40 80 160 320 640 1280 2560 synchronized 6346 8019 9027 7760 7202 8517 6562 12632 24941 49250 atomic 1430 1129 1147 1237 1589 2956 5758 11238 22410 44647 atomic_park 795 2007 3760 6229 7140 7015 7565 13423 25214 49557
Thus the technique for pausing for a short while after a CAS failure would only help you if the update code is very heavily contended, otherwise it would be counterproductive.
As with most benchmarks like this, we are experiencing lots of interesting effects as a result of cache miss costs. Please don't add parks to all your classes and expect your system to magically become 10x faster. As always, we want to tune our identified and proven bottlenecks. If you'd like to know more on how to do that, please have a look at our Java Specialist Master Course.
Kind regards from Crete
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.