Abstract: When are Strings deduplicated? How can we find out whether they are of any benefit in our application? How much does it cost in terms of CPU? We try to show you how we can get this from the deduplication statistics provided by the JVM.
Welcome to the 270th edition of The Java(tm) Specialists' Newsletter, sent to you from beautiful Crete. A very special welcome to my friends in Suriname, bringing the countries that read this newsletter to at least 149.
This week at jPrime: "Your newsletter is the most enjoyable thing for me to do on the toilet." I kid you not. Only complaint was that the default email format that I've been using is not responsive, so the fonts were impossible to read on his phone. The web version works fine. OK, ok, way too much information, I know. And the "most enjoyable thing"? Um, alright then. Well, here goes. I've tried to make the email more responsive and readable on cell phones. Let's see if it works.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
After my talk at jPrime 2019, which you can watch here, someone asked me how one could tell how effective string deduplication would be in their application. To me the best way of measuring is to look at the response rates and memory consumption. In the end, what does it matter if we have saved memory, but our performance is worse? Memory is cheap. On the other hand, less memory consumption leads to less reachable objects, and can thus give us better speed.
After wondering how I could get a peek into the deduplication
data, I remembered that there was a flag
-XX:+PrintStringDeduplicationStatistics
in Java
8. This changed to -Xlog:stringdedup*=debug
with Java 9 unified logging. Each time that deduplication is
triggered, it outputs a ream of useful information such as:
Concurrent String Deduplication (90.162s) Concurrent String Deduplication 72.0B->24.0B(48.0B) avg 59.8% (90.162s, 90.162s) 0.020ms Last Exec: 0.020ms, Idle: 16740.738ms, Blocked: 0/0.000ms Inspected: 3 Skipped: 0( 0.0%) Hashed: 0( 0.0%) Known: 0( 0.0%) New: 3(100.0%) 72.0B Deduplicated: 2( 66.7%) 48.0B( 66.7%) Young: 2(100.0%) 48.0B(100.0%) Old: 0( 0.0%) 0.0B( 0.0%) Total Exec: 3/1.014ms, Idle: 3/89992.409ms, Blocked: 0/0.000ms Inspected: 8457 Skipped: 0( 0.0%) Hashed: 610( 7.2%) Known: 5640( 66.7%) New: 2817( 33.3%) 86.7K Deduplicated: 1810( 64.3%) 51.8K( 59.8%) Young: 1810(100.0%) 51.8K(100.0%) Old: 0( 0.0%) 0.0B( 0.0%) Table Memory Usage: 187.7K Size: 4096, Min: 1024, Max: 16777216 Entries: 6644, Load: 162.2%, Cached: 0, Added: 6645, Removed: 1 Resize Count: 2, Shrink Threshold: 2730(66.7%), Grow Threshold: 8192(200.0%) Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0 Age Threshold: 3 Queue Dropped: 0
Every time a young GC is triggered (also called scavenge or partial GC), the age of our survivors increases. Once they reach the tenuring threshold, they are moved from the young to the old regions. By default, strings are deduplicated once they reach age 3, thus at their third young GC. We don't want to do it earlier, as deduplication costs precious CPU cycles. If the string won't survive into old age, there is also no need to try save memory. If for some reason the string is prematurely promoted into old, then it is deduplicated early as well.
To try this all out, I wrote my DeduplicationExplorer. To
get the most benefit, you'll have to run it yourself. You
connect to it via telnet localhost 8080
and then
send it text and commands. All the text is stored in an
ArrayList.
Here are the commands that we support:
print
: shows the strings in the list, together
with the value array referencesclear
: clears the strings from the listygc
: causes a young GCfgc
: causes a full GCclose
: closes the connectionHere is the code for DeduplicationExplorer. I recommend running it yourself:
import java.io.*; import java.lang.management.*; import java.lang.reflect.*; import java.net.*; import java.util.*;
// Java8: // -XX:+UseG1GC // -XX:+UseStringDeduplication // -XX:+PrintStringDeduplicationStatistics // -verbose:gc // Java11: // -XX:+UseStringDeduplication // -Xlog:stringdedup*=debug // -verbose:gc public class DeduplicationExplorer { public static void main(String... args) throws IOException { List<String> lines = new ArrayList<>(); Socket socket = new ServerSocket(8080).accept(); PrintStream out = new PrintStream( socket.getOutputStream(), true); out.println("Commands: clear, print, ygc, fgc, close"); BufferedReader in = new BufferedReader( new InputStreamReader( socket.getInputStream())); String line; while ((line = in.readLine()) != null) { System.out.println(line); switch (line) { case "clear": lines.clear(); break; case "print": print(lines); break; case "ygc": youngGC(); break; // young GC case "fgc": System.gc(); break; // full GC case "close": return; default: lines.add(line); } } }
private static void youngGC() { long collectionCount = YOUNG_GC.getCollectionCount(); do { // array is too big to be eliminated with escape analysis byte[] bytes = new byte[1024]; } while (YOUNG_GC.getCollectionCount() == collectionCount); }
private static void print(List<String> lines) { System.out.println("lines:"); lines.forEach(DeduplicationExplorer::print); }
private static void print(String line) { try { System.out.printf("\t\"%s\" - %s%n", line, VALUE.get(line)); } catch (IllegalAccessException e) { throw new IllegalStateException(e); } }
private static final Field VALUE;
static { try { VALUE = String.class.getDeclaredField("value"); VALUE.setAccessible(true); } catch (NoSuchFieldException e) { throw new Error(e); } }
private static final GarbageCollectorMXBean YOUNG_GC;
static { YOUNG_GC = ManagementFactory.getGarbageCollectorMXBeans() .stream() .filter(pool -> pool.getName().equals( "G1 Young Generation")) .findFirst() .orElseThrow(() -> new Error("Could not find G1 Young " + "Generation Garbage Collector MXBean")); } }
For example, I started the Oracle OpenJDK 12.0.1 with the following flags:
-XX:+UseStringDeduplication -Xlog:stringdedup*=debug -verbose:gc
I then connected with telnet localhost 8080
and
sent it:
hello hello hello print
On the program output, we saw this:
hello hello hello print lines: "hello" - [B@5d099f62 "hello" - [B@37f8bb67 "hello" - [B@49c2faae
As you see, the three strings each have their own byte[]. The hexadecimal number is the identity hash code, a random number that is fairly unique.
Next we send the command fgc
- Full GC. In our
DeduplicationExplorer, we see the GC event and the
deduplication statistics.
If we now send the print
command, we will see
that all the strings point to the same byte[]
.
If we clear the ArrayList with command clear
,
and then issue a full GC with fgc
, then the
shared byte[] will be collected. If we send "hello" three
more times and call fgc
and print
we will see something like:
fgc *snip* GC output and deduplication statistics *snip* print lines: "hello" - [B@2e0fa5d3 "hello" - [B@2e0fa5d3 "hello" - [B@2e0fa5d3
If the string happened to already exist in the JVM as a
constant, then that will be the basis of our shared byte[].
For example, let's clear
and then send the
string "main" three times, followed by print
.
We see this:
clear main main main print lines: "main" - [B@5010be6 "main" - [B@685f4c2e "main" - [B@7daf6ecc
If we now issue a fgc
and a print
we see:
fgc *snip* GC output and deduplication statistics *snip* print lines: "main" - [B@1f7e245f "main" - [B@1f7e245f "main" - [B@1f7e245f
Note that the byte[]
is none of those we had
read from the BufferedReader. And if we clear
and send "main" again and then issue a fgc
and a print
, we will see again
print lines: "main" - [B@1f7e245f
Since we have a method called "main" in our JVM, all our strings are sharing its byte[].
So far we have only considered full GC. However, I also have
support for triggering a young GC with the command
ygc
. I'm pretty proud of that code, as I woke
up at 5am and wrote it in my mind in my half sleep before
dozing off again. Usually code that I write whilst dreaming
does not work, but this does :-) I first find the
garbage collector MXBean for the G1 Young Generation.
I then loop whilst the collection count is the same and
allocate 1k byte arrays. Escape analysis won't remove the
heap allocation since the arrays are larger than 64. Here
is my dreamy code again:
private static void youngGC() { long collectionCount = YOUNG_GC.getCollectionCount(); do { // array is too big to be eliminated with escape analysis byte[] bytes = new byte[1024]; } while (YOUNG_GC.getCollectionCount() == collectionCount); }
private static final GarbageCollectorMXBean YOUNG_GC;
static { YOUNG_GC = ManagementFactory.getGarbageCollectorMXBeans() .stream() .filter(pool -> pool.getName().equals( "G1 Young Generation")) .findFirst() .orElseThrow(() -> new Error("Could not find G1 Young " + "Generation Garbage Collector MXBean")); }
We now issue the following commands:
hello hello hello ygc hello ygc hello ygc print
Each time that we send the ygc
we see a
"Pause Young (Normal)" appear on the console. After the
third ygc
, we also see the deduplication
statistics appear. The print
now looks like
this:
print lines: "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@37f8bb67 "hello" - [B@49c2faae
Thus the first three "hello" strings have been deduplicated,
but not hte other two. If we issue another ygc
and print
, this now becomes:
print lines: "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@49c2faae
And with the next ygc
, the print
yields:
print lines: "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62
It all works how it is supposed to.
-XX:+AlwaysTenure
A rather weird flag in OpenJDK is -XX:+AlwaysTenure
.
It effectively does away with the survivor spaces and promotes
all survivors from young to old. Whilst you would not want
to use this in production, it is fun to turn it on for our
experiment. If we now send the following:
hello hello hello ygc print
We will see similar output to what we saw with the call to System.gc():
hello hello hello ygc [154.162s][info][gc] GC(0) Pause Young (Normal) *snip* GC output and deduplication statistics *snip* print lines: "hello" - [B@5d099f62 "hello" - [B@5d099f62 "hello" - [B@5d099f62
Thank you for your support and for reading this newsletter.
Kind regards from Crete
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.