Lesson 4: Using Strings
Introduction
Strings have always been and will always be special because there are so many different ways of representing character-based data in C/C++. Historically, there were ASCII strings where each character occupied 7 bits of a byte. Then arose the need for some special characters like umlauts or accented characters. That gave rise to code pages that mapped full 8-bit character codes to ASCII plus a set of special characters.
Finally, we got Unicode in which every character glyph (visual representation) is represented by a unique code. It's actually quite a bit more complex when you take combining characters and different encodings into account.
The basic character literal types in C/C++ are 8-bit characters represented by the char
type and
wide characters represented by the wchar_t
type. The wchar_t
type can be 16 or 32-bit wide,
depending on the compiler and platform. String literals can be represented as:
const char * szString = "string value"; const wchar_t * wszString = L"string value";
Standard C++ introduced the basic_string
template with its partial specializations string
and wstring
to represent regular character and wide character strings, so you can also write:
std::string mString = "string value"; std::wstring mwString = L"string value";
Java on the other hand kept it much simpler: inside the JVM a string is a sequence of Java char
s, which are UTF-16 encoded
values. Since Java 9 there is also a more compact string representation that may be chosen when the string data allows it.
Create Java Strings from C++ Strings
We are of course dealing with the intersection of Java and C++, so we need to provide an easy-to-use gateway to create Java strings from C/C++ strings and vice versa.
We have tried to make this as easy as possible by providing conversion constructors from several C/C++ string types.
In general, you can simply assign a string literal to a proxy String
instance:
// from character literal java::lang::String str1 = "string value"; // from wide character literal java::lang::String str2 = L"string value"; // from std::string std::string mStr = "string value"; java::lang::String str3 = mStr;
This simplicity hides a lot of heavy lifting going on behind the scenes. To create a Java string from native
characters we need to know the native string's encoding. If we know that it is UTF-8 encoded we can
set the runtime's default encoding to "UTF-8"
, which will result in a much faster conversion of
native characters to a Java string. When no default encoding is set or when it is set to a different value, the
runtime must first convert the characters to a Java byte[]
and then call one of the String constructors that
take a byte[]
as input.
If performance matters to you and you can guarantee that your native strings are UTF-8 encoded (or simple ASCII) you
should definitely consider setting the default encoding to "UTF-8"
, for example via:
xmog_jvm_loader & loader = xmog_jvm_loader::get_jvm_loader(); loader.setDefaultEncoding("UTF-8");
You can also do this just for your current thread by calling setDefaultEncoding()
on your
xmog_localenv
pointer, or for all subsequently created Java threads by calling
setDefaultEncoding()
on your xmog_jvm
pointer.
Create C++ Strings from Java Strings
This direction adds a new complication. In the reverse direction, the native string is just used as input and we simply convert it to a Java string. The created Java string is always owned by a proxy object that is in charge of disposing of it when it is destroyed. The original native string remains unaffected and it is self-evident that it is the programmers job to keep track of it.
If we convert a Java string to a native string on the other hand, we are extracting the C characters (wide or regular) from a Java string, either into a pre-allocated buffer or into a newly allocated string. The big question here is: Who's in charge of the buffer or the dynamically allocated string?
The proxy String class provides a family of to_XXX
conversion functions that are all implemented internally via their
corresponding xmog_java_string::to_XXX
functions. The following snippet shows the three
most useful ones in use:
java::lang::String str1 = "str1", str2 = "str2", str3 = "str3"; char * pStr1 = str1.to_chars(); char * pStr2 = str2.to_charsUtf8(); wchar_t * pStr3 = str3.to_wchars();
Via the xmog_java_string
class
we provide several utility methods that retrieve the characters into preallocated buffers or into dynamically
allocated memory. When you use the xmog_java_string
utility class, you're always in charge of the memory
that holds the result. If you did not supply a buffer as input, you will have to free the returned result by calling
xmog_java_string::free()
on the returned string. When you're using the generated String proxy type you
should study its documentation for more information because there are different variants of the conversion methods.
All of the character accessors declared by the proxy String type return a pointer to a native string that is owned
by the proxy instance. This has the advantage of avoiding memory leaks but it has the disadvantage of tying the returned
characters to the String instance's lifecycle: when the String instance is destroyed, the returned character string
loses its validity. The following example illustrates this problem:
// --- DO NOT DO THIS --- char * native_str = NULL; { java::lang::String str = "test"; // the (char*) conversion operator returns a native string // that is owned by the proxy instance native_str = (char*)str; } // str goes out of scope here and the returned characters get deleted as well // might or might not work, depending on the allocator being used and whether // the native_str memory was reused since the cleanup of the 'str' instance cout << native_str << endl;
If you want the characters to endure you have to make a copy yourself. A useful pattern is shown below:
// --- THIS WORKS --- std::string native_str; { java::lang::String str = "test"; // the (char*) conversion operator returns a native string // that is owned by the proxy instance, but now it is assigned to a std::string // which makes a copy native_str = (char*)str; } // str goes out of scope here and so does the cached copy of the // native string pointer, but we have made a copy of it // this works cout << native_str << endl;
Avoid hanging on to the native string unless you really have to and your performance will be much better because you don't have to make copies of the strings all the time.
Patterns to Avoid
As just discussed, the proxy String instance owns the buffer in which the native string value
is returned and by default every invocation of a to_chars()
family function will
retrieve a new copy of the string. By necessity it will have to free the previously retrieved string
to hold on to the new one. This means that you should not write code like this:
// --- DO NOT DO THIS ---
java::lang::String str = "test";
std::cout << strcmp( str.to_chars(), str.to_chars() ) << std::endl;
In the above snippet, to_chars()
is called twice on the same string instance
and both returned pointers have to remain valid for strcmp
to work, but the second
invocation of to_chars()
will render the first invocation's result invalid and the
most likely result will be a crash.
It is also easy to forget that a proxy String is not a native string. Trying to use a
proxy string with the a function from the printf()
family will not work. Luckily, most
modern C++ compilers will give you compiler warnings or errors for this.
A final word of caution about global String
instances. Take a look at the following
snippet that is assumed to be at the head of a .cpp
file:
#include "java_lang_pkg.h" #include "java_util_pkg.h" static String GLOBAL_STR = "a value"; static Hashtable GLOBAL_HT = null; ...
At first glance, there's absolutely nothing wrong with this code. Assuming the header files are available, it will compile and link just fine. It might even run fine if the runtime library can locate a default JVM.
The problem is that the GLOBAL_STR
initialization triggers the on-demand loading of
the JVM in order to translate the native string to a Java string. Once the JVM has been loaded, no
further customizations to its settings will have any effect. In particular, the code in your
main()
function that configures the classpath will be ineffectual because it will be
executed after the constructors of global objects have been executed. Global String
instances are probably the #1 culprit for unintentional JVM loading and should therefore be avoided.