amazon s3 - S3 -> Redshift cannot handle UTF8

Question

Welcome To Ask or Share your Answers For Others

amazon s3 - S3 -> Redshift cannot handle UTF8

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:20:45+0000

tl;dr

the byte length for your varchar column just needs to be larger.

Detail

Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

AWS documentation for Multibyte Character Load Errors states the following:

VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Therefore if you want the character ? to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

AWS documentation for VARCHAR or CHARACTER VARYING states the following:

... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

Categories

amazon s3 - S3 -> Redshift cannot handle UTF8