Oct 11, 2017

String intern method - Java 6, 7 and 8

This article will describe how String.intern method was implemented in Java 6 and what changes were made in it in Java 7 and Java 8.

String pooling

String pooling (aka string canonicalisation) is a process of replacing several String objects with equal value but different identity with a single shared String object. You can achieve this goal by keeping your own Map<String, String> (with possibly soft or weak references depending on your requirements) and using map values as canonicalised values. Or you can use String.intern()method which is provided to you by JDK.
At times of Java 6 using String.intern() was forbidden by many standards due to a high possibility to get an OutOfMemoryException if pooling went out of control. Oracle Java 7 implementation of string pooling was changed considerably. You can look for details at http://bugs.sun.com/view_bug.do?bug_id=6962931 and http://bugs.sun.com/view_bug.do?bug_id=6962930.

String.intern() in Java 6

In those good old days all interned strings were stored in the PermGen – the fixed size part of heap mainly used for storing loaded classes and string pool. Besides explicitly interned strings, PermGen string pool also contained all literal strings earlier used in your program (the important word here is used – if a class or method was never loaded/called, any constants defined in it will not be loaded).
The biggest issue with such string pool in Java 6 was its location – the PermGen. PermGen has a fixed size and can not be expanded at runtime. You can set it using -XX:MaxPermSize=N option. As far as I know, the default PermGen size varies between 32M and 96M depending on the platform. You can increase its size, but its size will still be fixed. Such limitation required very careful usage of String.intern – you’d better not intern any uncontrolled user input using this method. That’s why string pooling at times of Java 6 was mostly implemented in the manually managed maps.

String.intern() in Java 7

Oracle engineers made an extremely important change to the string pooling logic in Java 7 – the string pool was relocated to the heap. It means that you are no longer limited by a separate fixed size memory area. All strings are now located in the heap, as most of other ordinary objects, which allows you to manage only the heap size while tuning your application. Technically, this alone could be a sufficient reason to reconsider using String.intern() in your Java 7 programs. But there are other reasons.

String pool values are garbage collected
Yes, all strings in the JVM string pool are eligible for garbage collection if there are no references to them from your program roots. It applies to all discussed versions of Java. It means that if your interned string went out of scope and there are no other references to it – it will be garbage collected from the JVM string pool.
Being eligible for garbage collection and residing in the heap, a JVM string pool seems to be a right place for all your strings, isn’t it? In theory it is true – non-used strings will be garbage collected from the pool, used strings will allow you to save memory in case then you get an equal string from the input. Seems to be a perfect memory saving strategy? Nearly so. You must know how the string pool is implemented before making any decisions.

JVM string pool implementation in Java 6, 7 and 8

The string pool is implemented as a fixed capacity hash map with each bucket containing a list of strings with the same hash code. Some implementation details could be obtained from the following Java bug report: http://bugs.sun.com/view_bug.do?bug_id=6962930.
The default pool size is 1009 (it is present in the source code of the above mentioned bug report, increased in Java7u40). It was a constant in the early versions of Java 6 and became configurable between Java6u30 and Java6u41. It is configurable in Java 7 from the beginning (at least it is configurable in Java7u02). You need to specify -XX:StringTableSize=N, where N is the string pool map size. Ensure it is a prime number for the better performance.
This parameter will not help you a lot in Java 6, because you are still limited by a fixed size PermGen size. The further discussion will exclude Java 6.

Java7 (until Java7u40)

In Java 7, on the other hand, you are limited only by a much higher heap size. It means that you can set the string pool size to a rather high value in advance (this value depends on your application requirements). As a rule, one starts worrying about the memory consumption when the memory data set size grows to at least several hundred megabytes. In this situation, allocating 8-16 MB for a string pool with one million entries seems to be a reasonable trade off (do not use 1,000,000 as a -XX:StringTableSize value – it is not prime; use 1,000,003 instead).
You must set a higher -XX:StringTableSize value (compared to the default 1009) if you intend to actively use String.intern() – otherwise this method performance will soon degrade to a linked list performance.
I have not noticed a dependency from a string length to a time to intern a string for string lengths under 100 characters (I feel that duplicates of even 50 character long strings are rather unlikely in the real world data, so 100 chars seems to be a good test limit for me).
Here is an extract from the test application log with the default pool size: time to intern 10.000 strings (second number) after a given number of strings was already interned (first number);Integer.toString( i ), where i between 0 and 999,999 were interned:
0; time = 0.0 sec
50000; time = 0.03 sec
100000; time = 0.073 sec
150000; time = 0.13 sec
200000; time = 0.196 sec
250000; time = 0.279 sec
300000; time = 0.376 sec
350000; time = 0.471 sec
400000; time = 0.574 sec
450000; time = 0.666 sec
500000; time = 0.755 sec
550000; time = 0.854 sec
600000; time = 0.916 sec
650000; time = 1.006 sec
700000; time = 1.095 sec
750000; time = 1.273 sec
800000; time = 1.248 sec
850000; time = 1.446 sec
900000; time = 1.585 sec
950000; time = 1.635 sec
1000000; time = 1.913 sec
These test results were obtained on Core i5-3317U@1.7Ghz CPU. As you can see, they grow linearly and I was able to intern only approximately 5,000 strings per second when the JVM string pool size contained one million strings. It is unacceptably slow for most of applications having to handle a large amount of data in memory.

0 comments:

Post a Comment