Codemesh Runtime v3 C++ API Reference
3.9.205
|
Classes | |
class | xmog_java_string |
A mixin or utility class for Java string features. More... | |
One of the most complicated problems in solving the Java/C++ integration problem is how java.lang.String
instances are handled. In old-style C++, there was no built-in string
type available; every string was either a char*
or a const char*
. Every class library started introducing its own string types and when STL came around, C++ finally got a template type to represent strings.
This situation makes dealing with strings very complicated because some people are not using any string abstractions while others are using niche string abstractions. Furthermore, there are huge issues with platform character sets (wide vs. single byte vs. multi byte) and the fact that a Java string is an object whose character string representation is not easily accessible from C++.
Consequently, the runtime supports many different options for strings. The string proxy type is easily the most complicated proxy type in the entire type system. The runtime library supplies a utility type called xmog_java_string which provides static methods for
The first thing you should know about has to do with the default character set (or encoding) that is used by your application. This encoding governs what happens when you write code like this:
All you're supplying is a C-string literal in an unknown character set and you wish to create a Java string instance from it. You can choose to go with the default used by your JRE or you can specify a particular encoding by setting the default encoding that is used by your native application. If you go with the default encoding, you might end up with different results depending on the kinds of characters you used in your string (plain old ASCII characters are always safe) and depending on the JRE you're using (Japanese vs. American vs. localized for example).
If you know that your application only uses UTF-8
strings, you can dramatically improve string-related performance by setting the default encoding to "UTF-8"
before you start the JVM. In this case, the runtime library will use a relatively speedy JNI method to create a Java string rather than using a complex, multi-step conversion process.
The default encoding can be set prior to starting the JVM by using the xmog_jvm_loader class, in which case the entire framework will use your specified encoding from the moment the JVM is loaded. Please note at this point that we're not configuring what the JVM does internally; the default encoding only governs how native strings are converted to Java strings and vice versa. You can also change the default encoding for a loaded xmog_jvm instance and even for a specific thread via the xmog_localenv type. This can be useful if you want to temporarily override the encoding for a specifc activity without affecting the encoding for the entire application. To summarize:
Please see the String
proxy type for more information on these constructors and conversion methods.
Most C++ compilers now also support wide character strings, often referred to as UNICODE strings. The wchar_t
type is the character type used for these strings. The runtime library assumes that wchar_t*
strings are synonymous to Java UNICODE strings, so there is no special encoding support beyond big-endian and little-endian translations. You should be able to use both single-byte and double-byte (wchar_t
) string literals in any place where a Java String
instance is expected.
One complication that arises in Microsoft's VisualStudio Line of C++ compilers stems from the fact that there are different compilation options regarding the treatment of the wchar_t
type. In old versions of the compiler (up to Visual C++ 6.0, compiler version 12), wchar_t
was always an emulated type and was really an unsigned short
. For a user, the different between an emulated wchar_t
and a built-in wchar_t
is pretty much nil. Later versions of the compiler introduced a compiler switch which governs whether you wish to have wchar_t
treated as a built-in type or as an emulated type (/Zc:wchar_t
option). Again, for the user, the difference between these settings is pretty much imperceptible. For a vendor who ships a library though or for someone who uses third-party libraries, the difference is a very big issue: depending on the version of the compiler and the value of this option, the C++ compiler mangles functionnames differently. If we had not done something special, you would have to use different runtime library versions for different build settings. In order to avoid this problem and its sibling problem of hard-to-understand link- or loadtime errors, we created overloaded entry points for many string-related functions and we also created a character type macro that you should use for maximum portability. The macro name is XMOG_WCHAR
and you use it like this:
As you can see, we cast the string literal to a XMOG_WCHAR*
. For many build settings, this cast will be unnecessary, but for some settings, it will avoid build errors.
This is even more complicated than converting native strings into Java strings. In the reverse direction, the native string is just used as input and we simply convert it to a Java string.
The original native string remains unaffected and it is self-evident that it is the programmers job to keep track of it.
If we convert a Java string to a native string on the other hand, we are extracting the C characters (wide or regular) from a Java string, either into a pre-allocated buffer or into a newly allocated string. The big question here is: Who's in charge of the buffer or the dynamically allocated string? Via the xmog_java_string class we provide several utility methods that retrieve the characters into preallocated buffers or into dynamically allocated memory. When you use the xmog_java_string utility class, you're always in charge of the memory that holds the result. If you did not supply a buffer as input, you will have to free the returned result by calling xmog_java_string::free() on the returned string. When you're using the generated String
proxy type you should study its documentation for more information because there are different variants of the conversion methods. Some variants force you to deal with the returned strings, others allow you to forget about the memory because it is managed by the String
proxy instance itself. The latter mechanism is higher performing in the case of multiple accesses to the same instance but it has the disadvantage of tying the returned characters to the String
instance's lifecycle: when the String
instance is destroyed, the returned character string loses its validity. The following example illustrates this problem:
The highest performing character set is definitely the UTF-8 character set. If you know that the strings that you use on the native side will not contain any characters that have to be expressed as multi-byte UTF-8 characters, or, even better, if you know that all natively used strings are UTF-8 strings, you can safely use this encoding and have the best possible conversion performance between the two sides. The easiest way to configure the runtime to use the UTF-8 characterset is to configure it before the JVM is started:
Once the JVM gets loaded, all newly attached threads will inherit the UTF-8 default encoding for transformations between Java and platform strings.
But there is more to string performance than simply converting quickly between the two representations. Java strings are immutable objects, so once they are created, their contents never change. This means that we can in many circumstances cache the value of a Java string on the native side. Caching a string value can make a huge difference if we extract the native string from the same proxy instance more than once and it makes no performance difference at all if we don't (actually, it has the potential to hurt performance a little bit if we have many String instances sitting around containing large, cached string values). You can theoretically request the native string for a Java string in more than one encoding or character set but the caching facility does not support this usecase. Once a string value has been cached, it will continue to be used, even if you're asking for it in a different encoding.
Another important characteristic of cached strings is that their validity is tied to the lifecycle of the the proxy string instance. If the proxy string instance is destroyed, the cached value is destroyed with it. This means that there is no caching taking place with temporary objects, as the following snippet illustrates:
Here, the characters are retrieved twice from the Java side because the conversion occurs on the temporary String instance created by the toUpperCase()
call and not on the temp
instance. The following snippet on the other hand takes full advantage of native string caching:
and will perform much better.
Take care when using global String variables in your code. You will sometimes be tempted to write code as in the following snippet:
In all likelihood, this program will not work totally as expected. You will either find that the program exits with an exception or that it uses totally different configuration information than you expected based on the contents of your configuration file.
The reason for this behavior is that the JVM is not loaded when you think it is loaded. By the time the program reaches main(), the C++ runtime will already have initialized global variables, including globalVar
. Because globalVar
is initialized with a string literal, the JVM will be demand-loaded at that time to perform a conversion of "my global string"
to the corresponding Java string. So by the time you get a chance to set your configuration preferences, all the action has already happened and once loaded, a JVM cannot simply be thrown away again and replaced with another one that is initialized differently.
This is of course not a problem that is specific to String
instances, but it is most easily overlooked with String
instances because all we're seeing is a string literal assignment.