Mar 23, 2015

Java 8 New Feature: String Deduplication

String objects consume a large amount of memory even in a small application. Various enhancements has been made relating to String in Java. Java 7 Update 6 improved the speed of String.substring() by sacrificing a little memory efficiency. Java 8 Update 20 takes a different approach by sacrificing a little CPU efficiency in order to reduce memory footprint. On the long run this should reduce the strain on the CPU too. In other words, most real world programs should run faster.
Java programs create huge quantity of character arrays. Most of them are part of String objects, meaning there’s a pointer to a character array. In practice, a lot of Strings are duplicated conduming huge amount of heap space.
However, the developers of Java got aware of the problem a long time ago, so they invented the String.intern() method. The disadvantage of this method is that you have to find which strings should be interned. This generally requires a heap analysis tool with a duplicate string lookup ability.
String deduplication is a clever way of calling String.intern() as part of the garbage collection. Since Java 8 U20 the garbage collector recognizes duplicates String and merges them. You must keep in mind the following things:

  •   You need to use G1 garbage collector and turn this feature on: -XX:+UseG1GC -XX:+UseStringDeduplication. This feature is implemented as an optional step of G1 garbage collector and not available if you are using any other garbage collector. You can’t use it with a parallel GC, which is generally a better choice for applications favouring throughput over latency.
  • This feature may be executed during minor GC of G1 collector. So, don’t expect it to work in a data cruncher which has all the data to process locally. On the other hand, a web server is likely to execute it very often.
  • String deduplication is looking for not processed strings, calculates their hash codes (if not calculated before by the application code) and then looks if there are any other strings with the same hash code and the equal underlying char[]. If found – it replaces a new string char[] with an existing string char[].
  • String deduplication is processing only strings which have survived a few garbage collections. This ensures that a majority of very short living strings will not be processed. The minimal string age is managed by -XX:StringDeduplicationAgeThreshold=3 JVM parameter (3 is the default value of this parameter).
  • Finally, remember that String.intern() will allow you to target only a subset of strings in your application which is known to contain duplicates. Generally it means a smaller pool of interned strings to compare with, which means you can use your CPU more efficiently. Besides, it allows you to intern full String objects, thus saving extra 24 bytes per string.

Hope you liked the post. Follow us at +Java Territory to stay updated.

6 comments: