Main Page | Modules | Class Hierarchy | Class List | Directories | File List | Class Members | Related Pages

String Handling


Classes

class  xmog_java_string
 A mixin or utility class for Java string features. More...

Detailed Description

Overview

One of the most complicated problems in solving the Java/C++ integration problem is how java.lang.String instances are handled. In old-style C++, there was no built-in string type available; every string was either a char* or a const char*. Every class library started introducing its own string types and when STL came around, C++ finally got a template type to represent strings.

This situation makes dealing with strings very complicated because some people are not using any string abstractions while others are using niche string abstractions. Furthermore, there are huge issues with platform character sets (wide vs. single byte vs. multi byte) and the fact that a Java string is an object whose character string representation is not easily accessible from C++.

Consequently, the runtime supports many different options for strings. The string proxy type is easily the most complicated proxy type in the entire type system. The runtime library supplies a utility type called xmog_java_string which provides static methods for

Single-byte and multi-byte native strings

The first thing you should know about has to do with the default character set (or encoding) that is used by your application. This encoding governs what happens when you write code like this:

      java::lang::String        temp = "This is a test!";

All you're supplying is a C-string literal in an unknown character set and you wish to create a Java string instance from it. You can choose to go with the default used by your JRE or you can specify a particular encoding by setting the default encoding that is used by your native application. If you go with the default encoding, you might end up with different results depending on the kinds of characters you used in your string (plain old ASCII characters are always safe) and depending on the JRE you're using (Japanese vs. American vs. localized for example).

If you know that your application only uses UTF-8 strings, you can dramatically improve string-related performance by setting the default encoding to "UTF-8" before you start the JVM. In this case, the runtime library will use a relatively speedy JNI method to create a Java string rather than using a complex, multi-step conversion process.

The default encoding can be set prior to starting the JVM by using the xmog_jvm_loader class, in which case the entire framework will use your specified encoding from the moment the JVM is loaded. Please note at this point that we're not configuring what the JVM does internally; the default encoding only governs how native strings are converted to Java strings and vice versa. You can also change the default encoding for a loaded xmog_jvm instance and even for a specific thread via the xmog_localenv type. This can be useful if you want to temporarily override the encoding for a specifc activity without affecting the encoding for the entire application. To summarize:

Please see the String proxy type for more information on these constructors and conversion methods.

Wide strings

Most C++ compilers now also support wide character strings, often referred to as UNICODE strings. The wchar_t type is the character type used for these strings. The runtime library assumes that wchar_t* strings are synonymous to Java UNICODE strings, so there is no special encoding support beyond big-endian and little-endian translations. You should be able to use both single-byte and double-byte (wchar_t) string literals in any place where a Java String instance is expected.

One complication that arises in Microsoft's VisualStudio Line of C++ compilers stems from the fact that there are different compilation options regarding the treatment of the wchar_t type. In old versions of the compiler (up to Visual C++ 6.0, compiler version 12), wchar_t was always an emulated type and was really an unsigned short. For a user, the different between an emulated wchar_t and a built-in wchar_t is pretty much nil. Later versions of the compiler introduced a compiler switch which governs whether you wish to have wchar_t treated as a built-in type or as an emulated type (/Zc:wchar_t option). Again, for the user, the difference between these settings is pretty much imperceptible. For a vendor who ships a library though or for someone who uses third-party libraries, the difference is a very big issue: depending on the version of the compiler and the value of this option, the C++ compiler mangles functionnames differently. If we had not done something special, you would have to use different runtime library versions for different build settings. In order to avoid this problem and its sibling problem of hard-to-understand link- or loadtime errors, we created overloaded entry points for many string-related functions and we also created a character type macro that you should use for maximum portability. The macro name is XMOG_WCHAR and you use it like this:

      XMOG_WCHAR *                native_wstr = NULL;
   
      {
        java::lang::String    str = (XMOG_WCHAR*)L"test";
        native_wstr = str;
        
        cout << native_wstr << endl;
      } // str goes out of scope here and the returned characters get deleted as well

As you can see, we cast the string literal to a XMOG_WCHAR*. For many build settings, this cast will be unnecessary, but for some settings, it will avoid build errors.

Converting Java Strings to native strings

This is even more complicated than converting native strings into Java strings. In the reverse direction, the native string is just used as input and we simply convert it to a Java string. The original native string remains unaffected and it is self-evident that it is the programmers job to keep track of it.

If we convert a Java string to a native string on the other hand, we are extracting the C characters (wide or regular) from a Java string, either into a pre-allocated buffer or into a newly allocated string. The big question here is: Who's in charge of the buffer or the dynamically allocated string? Via the xmog_java_string class we provide several utility methods that retrieve the characters into preallocated buffers or into dynamically allocated memory. When you use the xmog_java_string utility class, you're always in charge of the memory that holds the result. If you did not supply a buffer as input, you will have to free the returned result by calling xmog_java_string::free() on the returned string. When you're using the generated String proxy type you should study its documentation for more information because there are different variants of the conversion methods. Some variants force you to deal with the returned strings, others allow you to forget about the memory because it is managed by the String proxy instance itself. The latter mechanism is higher performing in the case of multiple accesses to the same instance but it has the disadvantage of tying the returned characters to the String instance's lifecycle: when the String instance is destroyed, the returned character string loses its validity. The following example illustrates this problem:

      // --- DO NOT DO THIS ---
      char *                native_str = NULL;
   
      {
        java::lang::String    str = "test";
        native_str = (char*)str;
      } // str goes out of scope here and the returned characters get deleted as well
   
      // might or might not work, depending on whether the native_str memory was reused 
      // since the cleanup of the 'str' instance
      cout << native_str << endl;

Performance

The highest performing character set is definitely the UTF-8 character set. If you know that the strings that you use on the native side will not contain any characters that have to be expressed as multi-byte UTF-8 characters, or, even better, if you know that all natively used strings are UTF-8 strings, you can safely use this encoding and have the best possible conversion performance between the two sides. The easiest way to configure the runtime to use the UTF-8 characterset is to configure it before the JVM is started:

      xmog_jvm_loader & loader = xmog_jvm_loader::get_jvm_loader( ... );
   
      loader.setDefaultEncoding( "UTF-8" );

Once the JVM gets loaded, all newly attached threads will inherit the UTF-8 default encoding for transformations between Java and platform strings.

But there is more to string performance than simply converting quickly between the two representations. Java strings are immutable objects, so once they are created, their contents never change. This means that we can in many circumstances cache the value of a Java string on the native side. Caching a string value can make a huge difference if we extract the native string from the same proxy instance more than once and it makes no performance difference at all if we don't (actually, it has the potential to hurt performance a little bit if we have many String instances sitting around containing large, cached string values). You can theoretically request the native string for a Java string in more than one encoding or character set but the caching facility does not support this usecase. Once a string value has been cached, it will continue to be used, even if you're asking for it in a different encoding.

Another important characteristic of cached strings is that their validity is tied to the lifecycle of the the proxy string instance. If the proxy string instance is destroyed, the cached value is destroyed with it. This means that there is no caching taking place with temporary objects, as the following snippet illustrates:

      java::lang::String    temp = "temp";
   
      cout << temp.toUpperCase() << temp.toUpperCase() << endl;

Here, the characters are retrieved twice from the Java side because the conversion occurs on the temporary String instance created by the toUpperCase() call and not on the temp instance. The following snippet on the other hand takes full advantage of native string caching:

      java::lang::String    temp = "temp";
      java::lang::String    ucTemp = temp.toUpperCase();
   
      cout << ucTemp << ucTemp << endl;

and will perform much better.

Notes

Take care when using global String variables in your code. You will sometimes be tempted to write code as in the following snippet:

      #include "java_lang.h"
   
      const java::lang::String    globalVar = "my global string";
   
      int main()
      {
          xmog_jvm_loader &   loader = xmog_jvm_loader::get_jvm_loader( "c:\\temp\\myconfig.dat" );
   
          loader.load();
   
          printf( "%s\n", (char*)global );
   
          return 0;
      }

In all likelihood, this program will not work totally as expected. You will either find that the program exits with an exception or that it uses totally different configuration information than you expected based on the contents of your configuration file.

The reason for this behavior is that the JVM is not loaded when you think it is loaded. By the time the program reaches main(), the C++ runtime will already have initialized global variables, including globalVar. Because globalVar is initialized with a string literal, the JVM will be demand-loaded at that time to perform a conversion of "my global string" to the corresponding Java string. So by the time you get a chance to set your configuration preferences, all the action has already happened and once loaded, a JVM cannot simply be thrown away again and replaced with another one that is initialized differently.

This is of course not a problem that is specific to String instances, but it is most easily overlooked with String instances because all we're seeing is a string literal assignment.


Generated on Wed May 31 14:01:35 2006 for Shared Codemesh Runtime Library API Reference by  doxygen 1.4.1