: code tutorial (v3)

Lesson 4: Using Strings

Introduction

Strings have always been and will always be special because there are so many different ways of representing character-based data in C/C++. Historically, there were ASCII strings where each character occupied 7 bits of a byte. Then arose the need for some special characters like umlauts or accented characters. That gave rise to code pages that mapped full 8-bit character codes to ASCII plus a set of special characters.

Finally, we got Unicode in which every character glyph (visual representation) is represented by a unique code. It's actually quite a bit more complex when you take combining characters and different encodings into account.

The basic character literal types in C/C++ are 8-bit characters represented by the char type and wide characters represented by the wchar_t type. The wchar_t type can be 16 or 32-bit wide, depending on the compiler and platform. String literals can be represented as:

const char *     szString = "string value";
const wchar_t *  wszString = L"string value";

Standard C++ introduced the basic_string template with its partial specializations string and wstring to represent regular character and wide character strings, so you can also write:

std::string     mString = "string value";
std::wstring    mwString = L"string value";

Java on the other hand kept it much simpler: inside the JVM a string is a sequence of Java chars, which are UTF-16 encoded values. Since Java 9 there is also a more compact string representation that may be chosen when the string data allows it.

Create Java Strings from C++ Strings

We are of course dealing with the intersection of Java and C++, so we need to provide an easy-to-use gateway to create Java strings from C/C++ strings and vice versa.

We have tried to make this as easy as possible by providing conversion constructors from several C/C++ string types. In general, you can simply assign a string literal to a proxy String instance:

// from character literal
java::lang::String  str1 = "string value";
// from wide character literal
java::lang::String  str2 = L"string value";
// from std::string
std::string         mStr = "string value";
java::lang::String  str3 = mStr;

This simplicity hides a lot of heavy lifting going on behind the scenes. To create a Java string from native characters we need to know the native string's encoding. If we know that it is UTF-8 encoded we can set the runtime's default encoding to "UTF-8", which will result in a much faster conversion of native characters to a Java string. When no default encoding is set or when it is set to a different value, the runtime must first convert the characters to a Java byte[] and then call one of the String constructors that take a byte[] as input.

If performance matters to you and you can guarantee that your native strings are UTF-8 encoded (or simple ASCII) you should definitely consider setting the default encoding to "UTF-8", for example via:

xmog_jvm_loader & loader = xmog_jvm_loader::get_jvm_loader();
loader.setDefaultEncoding("UTF-8");

You can also do this just for your current thread by calling setDefaultEncoding() on your xmog_localenv pointer, or for all subsequently created Java threads by calling setDefaultEncoding() on your xmog_jvm pointer.

Create C++ Strings from Java Strings

This direction adds a new complication. In the reverse direction, the native string is just used as input and we simply convert it to a Java string. The created Java string is always owned by a proxy object that is in charge of disposing of it when it is destroyed. The original native string remains unaffected and it is self-evident that it is the programmers job to keep track of it.

If we convert a Java string to a native string on the other hand, we are extracting the C characters (wide or regular) from a Java string, either into a pre-allocated buffer or into a newly allocated string. The big question here is: Who's in charge of the buffer or the dynamically allocated string?

The proxy String class provides a family of to_XXX conversion functions that are all implemented internally via their corresponding xmog_java_string::to_XXX functions. The following snippet shows the three most useful ones in use:

java::lang::String  str1 = "str1",  str2 = "str2", str3 = "str3";
char *              pStr1 = str1.to_chars();
char *              pStr2 = str2.to_charsUtf8();
wchar_t *           pStr3 = str3.to_wchars();

Via the xmog_java_string class we provide several utility methods that retrieve the characters into preallocated buffers or into dynamically allocated memory. When you use the xmog_java_string utility class, you're always in charge of the memory that holds the result. If you did not supply a buffer as input, you will have to free the returned result by calling xmog_java_string::free() on the returned string. When you're using the generated String proxy type you should study its documentation for more information because there are different variants of the conversion methods. All of the character accessors declared by the proxy String type return a pointer to a native string that is owned by the proxy instance. This has the advantage of avoiding memory leaks but it has the disadvantage of tying the returned characters to the String instance's lifecycle: when the String instance is destroyed, the returned character string loses its validity. The following example illustrates this problem:

// --- DO NOT DO THIS ---
char *                native_str = NULL;

{
    java::lang::String    str = "test";
    // the (char*) conversion operator returns a native string
    // that is owned by the proxy instance
    native_str = (char*)str;
} // str goes out of scope here and the returned characters get deleted as well

// might or might not work, depending on the allocator being used and whether
// the native_str memory was reused since the cleanup of the 'str' instance
cout << native_str << endl;

If you want the characters to endure you have to make a copy yourself. A useful pattern is shown below:

// --- THIS WORKS ---
std::string               native_str;

{
    java::lang::String    str = "test";
    // the (char*) conversion operator returns a native string
    // that is owned by the proxy instance, but now it is assigned to a std::string
    // which makes a copy
    native_str = (char*)str;
} // str goes out of scope here and so does the cached copy of the
  // native string pointer, but we have made a copy of it

// this works
cout << native_str << endl;

Avoid hanging on to the native string unless you really have to and your performance will be much better because you don't have to make copies of the strings all the time.

Patterns to Avoid

As just discussed, the proxy String instance owns the buffer in which the native string value is returned and by default every invocation of a to_chars() family function will retrieve a new copy of the string. By necessity it will have to free the previously retrieved string to hold on to the new one. This means that you should not write code like this:

// --- DO NOT DO THIS ---
java::lang::String   str = "test";
std::cout << strcmp( str.to_chars(), str.to_chars() ) << std::endl;

In the above snippet, to_chars() is called twice on the same string instance and both returned pointers have to remain valid for strcmp to work, but the second invocation of to_chars() will render the first invocation's result invalid and the most likely result will be a crash.

It is also easy to forget that a proxy String is not a native string. Trying to use a proxy string with the a function from the printf() family will not work. Luckily, most modern C++ compilers will give you compiler warnings or errors for this.

A final word of caution about global String instances. Take a look at the following snippet that is assumed to be at the head of a .cpp file:

#include "java_lang_pkg.h"
#include "java_util_pkg.h"

static String     GLOBAL_STR = "a value";
static Hashtable  GLOBAL_HT = null;

...

At first glance, there's absolutely nothing wrong with this code. Assuming the header files are available, it will compile and link just fine. It might even run fine if the runtime library can locate a default JVM.

The problem is that the GLOBAL_STR initialization triggers the on-demand loading of the JVM in order to translate the native string to a Java string. Once the JVM has been loaded, no further customizations to its settings will have any effect. In particular, the code in your main() function that configures the classpath will be ineffectual because it will be executed after the constructors of global objects have been executed. Global String instances are probably the #1 culprit for unintentional JVM loading and should therefore be avoided.