utf 8 - How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

Question

Welcome To Ask or Share your Answers For Others

utf 8 - How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

utf 8 - How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

I don't know how to solve that:

Imagine, we have 4 websites:

A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16

My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".

The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:

How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:52:32+0000

How can I convert "any" (or the most used) character encoding to UTF-8?

ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).

You create a converter for a given encoding...

#include <ucnv.h>

UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
    // ...
    ucnv_close( converter );
}

...and then use the UnicodeString class as appripriate.

I think wchar_t does not work because it is 2 bytes long.

The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.

Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

Check BreakIterator.

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.

Categories

utf 8 - How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

utf 8 - How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags