Abstract: Java's Thread.yield() traditionally would cause a voluntary context switch. However, with virtual threads, it causes the thread to be unmounted from its carrier thread. This can greatly increase the user and system time for ALL the carrier threads. And there is no way to turn it off. Keep an eye out for yield in your thread dumps.
Welcome to the 320th edition of The Java(tm) Specialists' Newsletter. Today is 24 years since I sent out my first Java newsletter and some of my original readers are still on our list. I often meet Java programmers at conferences who have been reading my prose for decades. It's truly quite humbling. I believe that ours is the oldest Java newsletter in the world. The second oldest is Jack Shirazi's Java Performance Tuning. He started three weeks after me on the 20th December 2000.
Java Concurrency Aficionados 2024: Brand new bundle of courses for those who want to go much deeper with threading. Includes our newly recorded Mastering Virtual Threads in Java and Mastering Platform Threads in Java Courses. Until the 4th December 2024 we have a 30% discount off the list price. Part of our Black Friday specials.
Java Specialists Superpack '24: Our top product, contains all our courses, plus includes a free upgrade to the Superpack '25. Now for under $100 a month over 15 months (excluding your local taxes). Also a Black Friday special. Grab it before the 4th December!
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
Thread.yield() has never been well defined. Here is the Javadoc for yield() in Java 24:
/** * A hint to the scheduler that the current thread is willing to yield * its current use of a processor. The scheduler is free to ignore this * hint. * * Yield is a heuristic attempt to improve relative progression * between threads that would otherwise over-utilise a CPU. Its use * should be combined with detailed profiling and benchmarking to * ensure that it actually has the desired effect. * * It is rarely appropriate to use this method. It may be useful * for debugging or testing purposes, where it may help to reproduce * bugs due to race conditions. It may also be useful when designing * concurrency control constructs such as the ones in the * {@link java.util.concurrent.locks} package. */
Java 1.0. made slightly stronger promises:
/** * Causes the currently executing Thread object to yield. * If there are other runnable Threads they will be * scheduled next. */
Most of the time, a Thread.yield() will cause a forced voluntary context switch. The thread will thus stop using its time quantum and allow another thread to be scheduled. Unnecessary yields can negatively impact our performance, by driving up system time (that's usually the red on the CPU graphs).
Since Thread.yield() can be so negative, we have had a JVM flag to turn it off with -XX:+DontYieldALot, turning the yield into a no-op.
The ThreadMXBean allows us to measure the CPU and user time of a thread. The difference between these values is the system time. Threads that yield a alot would typically have high system time. For example:
import java.lang.management.*; import java.util.concurrent.*; import java.util.concurrent.atomic.*; public class YieldALot { public static void main(String... args) { var timer = Executors.newSingleThreadScheduledExecutor(); var running = new AtomicBoolean(true); timer.schedule(() -> running.set(false), 1, TimeUnit.SECONDS); var tmb = ManagementFactory.getThreadMXBean(); var cpu = tmb.getCurrentThreadCpuTime(); var usr = tmb.getCurrentThreadUserTime(); var counter = 0; while (running.get() && ++counter > 0) Thread.yield(); cpu = tmb.getCurrentThreadCpuTime() - cpu; usr = tmb.getCurrentThreadUserTime() - usr; System.out.printf("CPU time = %,d%n", cpu); System.out.printf("User time = %,d%n", usr); System.out.printf("System time = %,d%n", cpu - usr); System.out.printf("counter = %,d%n", counter); timer.shutdown(); } }
Running this with Java 21 on my MacBook Pro M2 Max, I get the following output, with over 30% of cpu time spent in system time doing voluntary context switches:
CPU time = 986,448,000 User time = 679,551,000 System time = 306,897,000 counter = 7,284,820
If we run the exact code with the -XX:+DontYieldALot JVM flag, we see our system time going down to about 6% and our counter incrementing 35x more:
CPU time = 1,001,147,000 User time = 995,077,000 System time = 6,070,000 counter = 251,681,803
This is a crazy difference, which is also why we have this flag. Perhaps a coder used Thread.yield() in a vain attempt to solve a starvation issue in their code, and thus greatly reduced throughput. We can easily spot the yield() in our thread dumps, plus the system time would be a clue that we have an issue.
You might be interested to hear that the DontYieldALot flag has been removed since Java 24. Yielding works a bit differently with virtual threads. The virtual thread state for a mounted executing thread is RUNNING. If we call Thread.yield(), it goes into the YIELDING state. From there it would typically go into the YIELDED state, although it is possible that the thread goes back into RUNNING if the yield failed. In the YIELDED state, the virtual thread is unmounted. The next time a carrier thread becomes available, our virtual thread will go from YIELDED back to RUNNING.
A virtual thread that is yielding would thus flip between different carrier threads, as this code illustrates:
import java.util.concurrent.*; public class YieldingInVirtualThreads { public static void main(String... args) throws InterruptedException { var threads = new ConcurrentSkipListSet<>(); Thread.startVirtualThread(() -> { for (int i = 0; i < 1_000_000; i++) { threads.add("" + Thread.currentThread()); Thread.yield(); } }).join(); threads.forEach(System.out::println); } }
My laptop has 12 cores, thus the JVM will automatically create 12 carrier threads. We see that our one virtual thread flips between all the carrier threads:
VirtualThread[#27]/runnable@ForkJoinPool-1-worker-1 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-10 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-11 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-12 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-2 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-3 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-4 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-5 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-6 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-7 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-8 VirtualThread[#27]/runnable@ForkJoinPool-1-worker-9
I have taught my Mastering Virtual Threads in Java Course live a number of times. I noticed during the last run that a single yielding virtual thread actually used far more CPU time than I had expected. For example when I run my code above with the unix utility "time", I get the following results:
real 0m2.479s user 0m5.065s sys 0m8.777s
Yikes, thus the program runs for 2.5 seconds in elapsed time (real) and during that time it burns up almost 14 seconds of CPU time! Almost half the available CPU cycles are spent managing the yielding of that one virtual thread.
In order to measure this more rigorously, I wrote this TestBench class. It automatically finds the carrier threads and measures the CPU and user time once a second. The TestBench can be run with platform or virtual threads and we can turn yielding on and off. Each runExperiment() lasts about 10.3 seconds. Here is the code:
import java.lang.management.*; import java.util.*; import java.util.concurrent.*; import java.util.concurrent.atomic.*; public class TestBench { private static final String BENCH_THREAD_NAME = "TestBench-"; private final Thread.Builder builder; private final int numberOfThreads; private final boolean yielding; private final LongAdder progress = new LongAdder(); private final LongAdder allEvenNumbers = new LongAdder(); private final AtomicBoolean running = new AtomicBoolean(true); public TestBench(Thread.Builder builder, int numberOfThreads, boolean yielding) { this.builder = builder; this.numberOfThreads = numberOfThreads; this.yielding = yielding; } /** * We attempt to count all the even positive numbers. * However, since we increment by 2, we overflow over * Integer.MAX_VALUE and thus this loop will never end. */ private void countEvenNumbers() { var evenNumbers = 0L; for (int i = 0; i < Integer.MAX_VALUE && running.get(); i += 2) { evenNumbers++; if (yielding) Thread.yield(); if (i % 100_000 == 99_998) progress.increment(); } allEvenNumbers.add(evenNumbers); } /** * We launch all our test threads, */ public void runExperiment() { System.out.println("Starting experiment with builder " + builder.getClass().getSimpleName() + ", " + numberOfThreads + " thread" + (numberOfThreads == 1 ? "" : "s") + " and " + (yielding ? "" : "no ") + "yielding"); var timer = Executors.newSingleThreadScheduledExecutor(); timer.scheduleAtFixedRate(this::printCPUTimes, 1, 1, TimeUnit.SECONDS); timer.schedule(() -> { running.set(false); timer.shutdownNow(); }, 10300, TimeUnit.MILLISECONDS); var threads = new Thread[numberOfThreads]; for (int i = 0; i < threads.length; i++) { threads[i] = builder.name(BENCH_THREAD_NAME + i) .start(this::countEvenNumbers); System.out.println("Launched thread " + threads[i]); } try { builder.start(() -> System.out.println("Launched all threads") ).join(); for (var thread : threads) thread.join(); } catch (InterruptedException e) { throw new CancellationException("interrupted"); } System.out.printf("allEvenNumbers = %,d%n", allEvenNumbers.longValue()); } private static final ThreadMXBean tmb = ManagementFactory.getThreadMXBean(); private final String carrierThreadPattern = discoverCarrierThreadPattern(); private String discoverCarrierThreadPattern() { var threadString = CompletableFuture.supplyAsync( () -> Thread.currentThread().toString(), Executors.newVirtualThreadPerTaskExecutor()) .join(); var lastAtIndex = threadString.lastIndexOf('@'); if (lastAtIndex < 0) throw new IllegalStateException(); var lastMinusIndex = threadString.lastIndexOf('-'); if (lastMinusIndex < 0) throw new IllegalStateException(); return threadString.substring( lastAtIndex + 1, lastMinusIndex + 1); } private final Map<Long, Time> history = new ConcurrentHashMap<>(); private record Time(long cpu, long usr) { } private static final Time ZERO_TIME = new Time(0, 0); private void printCPUTimes() { // We are going to search for platform threads that are // either carrier threads or raw TestBench-* threads. long totalCpuTime = 0, totalUserTime = 0, totalSysTime = 0; for (var tid : tmb.getAllThreadIds()) { var info = tmb.getThreadInfo(tid); if (isWatched(info)) { var prev = history.getOrDefault(tid, ZERO_TIME); var curr = new Time( tmb.getThreadCpuTime(tid), tmb.getThreadUserTime(tid) ); history.put(tid, curr); var cpuTime = toMs(curr.cpu() - prev.cpu()); if (cpuTime > 0) { var userTime = toMs(curr.usr() - prev.usr()); var sysTime = cpuTime - userTime; totalCpuTime += cpuTime; totalUserTime += userTime; totalSysTime += sysTime; System.out.printf("%s\tcpu=%d\tusr=%d\tsys=%d%n", info.getThreadName(), cpuTime, userTime, sysTime); } } } System.out.printf("Total\tcpu=%d\tusr=%d\tsys=%d\tprogress=%d%n", totalCpuTime, totalUserTime, totalSysTime, progress.longValue()); System.out.println(); } private boolean isWatched(ThreadInfo info) { var threadName = info.getThreadName(); return threadName.startsWith(carrierThreadPattern) || threadName.startsWith(BENCH_THREAD_NAME); } private long toMs(long nanos) { return nanos / 1_000_000; } }
Here is an example of running it for a single platform thread that does not yield. It manages to find 9,393,103,267 even numbers, which is the measure of work it managed to do in the 10.3 seconds. Running with "time", we see that real time is 10.7 seconds, user cpu time is 11.1 seconds and system cpu time is 0.2 seconds. Thus the cpu time is close to the real time, which is what we would like to see for a single-threaded test.
public class SinglePlatformThreadsNoYielding { public static void main(String... args) { var bench = new TestBench(Thread.ofPlatform(), 1, false); bench.runExperiment(); } } // allEvenNumbers = 9,393,103,267 // real 0m10.717s // user 0m11.092s // sys 0m0.184s
In our next test, we have as many platform threads as we have hardware threads or cores, in my case 12. We were able to do 9x more work with the 12 cores, which is an acceptable speedup. After all, the ghost of Amdahl is demanding his pound of bits. This time, we see a total CPU time of 115, which is almost 11x that of real time. Again, all within an acceptable range:
public class SeveralPlatformThreadsNoYielding { public static void main(String... args) { var bench = new TestBench(Thread.ofPlatform(), Runtime.getRuntime().availableProcessors(), false); bench.runExperiment(); } } // allEvenNumbers = 85,245,904,577 // real 0m10.721s // user 1m54.402s // sys 0m0.608s
Running these experiments with virtual threads gives similar results. However, note that when we run CPU based code on the carrier threads, they cannot be used for anything else. We thus recommend regularly making thread dumps and searching for "Carrying". Ideally the carrier threads shouldn't be stuck doing grunt CPU work. Rather outsource that work to the common ForkJoinPool.
It gets interesting when we yield a lot. For example, here is a single platform thread, yielding in every loop cycle. It is again spending more than 30% in system time, and does 127x less work than the non-yielding single threaded experiment.
public class SinglePlatformThreadYielding { public static void main(String... args) { // -XX:+DontYieldALot support removed in Java 24 var bench = new TestBench(Thread.ofPlatform(), 1, true); bench.runExperiment(); } } // allEvenNumbers = 73,903,324 // real 0m10.692s // user 0m8.164s // sys 0m3.235s
It gets remarkably worse with a single virtual thread, so much so that I'm tempted to log a bug report. It is almost 1200x slower than the non-yielding single virtual thread experiment. Also, even though it is a single-threaded experiment, it uses a whopping 22.6 seconds of user time and 57 seconds of system time. The system time is 71% of the total cpu time.
public class SingleVirtualThreadYielding { public static void main(String... args) { var bench = new TestBench(Thread.ofVirtual(), 1, true); bench.runExperiment(); } } // allEvenNumbers = 7,959,055 // real 0m10.677s // user 0m22.577s // sys 0m56.556s
When we have several yielding platform threads, our system time starts using up 86% of our total cpu time. We are just thrashing.
public class SeveralPlatformThreadsYielding { public static void main(String... args) { // -XX:+DontYieldALot support removed in Java 24 var bench = new TestBench(Thread.ofPlatform(), Runtime.getRuntime().availableProcessors(), true); bench.runExperiment(); } } // allEvenNumbers = 92,293,566 // real 0m10.680s // user 0m16.067s // sys 1m40.791s
The same experiment with virtual threads is about 2x faster, but still 500x slower than the non-yielding version. Since all the state changes are happening in the managed threads and it seems that threads are using up their time quantums, there is very little system time being used.
public class SeveralVirtualThreadsYielding { public static void main(String... args) { var bench = new TestBench(Thread.ofVirtual(), Runtime.getRuntime().availableProcessors(), true); bench.runExperiment(); } } // allEvenNumbers = 168,633,713 // real 0m10.713s // user 1m55.022s // sys 0m0.686s
It would be interesting to dig into why the single yielding virtual thread uses so much user and system time. It may well be a performance bug that needs to be addressed. Perhaps it has something to do the stealing in the ForkJoinPool? It looks like a fun performance project to look into, unless someone beats me to it.
For now, be very careful if you have any code that you might be calling that invokes Thread.yield(). We should probably make regular thread dumps and grep for yield, just in case this shows up somewhere.
Kind regards
Heinz
P.S. Be sure to check out all our Black Friday Specials by clicking here.
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.