Update:
I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below).
Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via SimpleIni (cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program.
Original:
I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough.
The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution).
This is what I have found:
gcc (v4.3.2 20081105):
- string literals are used as is (raw strings)
- supports UTF-8 encoded source files
- source files must not have a UTF-8 BOM
vc2003:
- string literals are used as is (raw strings)
- supports UTF-8 encoded source files
- source files may or may not have a UTF-8 BOM (it doesn't matter)
vc2005+:
- string literals are massaged by the compiler (no raw strings)
- char string literals are re-encoded to a specified locale
- UTF-8 is not supported as a target locale
- source files must have a UTF-8 BOM
So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use.
There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU project for some details.
In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following.
#if defined(_MSC_VER) && _MSC_VER > 1310
// Visual C++ 2005 and later require the source files in UTF-8, and all strings
// to be encoded as wchar_t otherwise the strings will be converted into the
// local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these
// strings then need to be convert back to UTF-8. This function is just a rough
// example of how to do this.
# define utf8(str) ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
static char szBuf[1024];
WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
return szBuf;
}
#else
// Visual C++ 2003 and gcc will use the string literals as is, so the files
// should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM.
# define utf8(str) str
#endif
Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc).
This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008:
std::string mText;
mText = utf8("Chinese (Traditional)");
mText = utf8("中国語 (繁体)");
mText = utf8("??? (??)");
mText = utf8("Chinês (Tradicional)");