Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
568 views
in Technique[技术] by (71.8m points)

unicode - Encoding binary data within XML: Are there better alternatives than base64?

I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Which means that the unallowed are:

  • 29 Unicode control characters are illegal (0x00 - 0x20) ie (000xxxxx) except 0x09, 0x0A, 0x0D
  • Any Unicode character representation above 2 bytes (UTF-16+) is illegal (U+D800 - U+DFFF) ie (11011xxx)
  • The special Unicode noncharacters are illegal (0xFFFE - 0xFFFF) ie (11111111 1111111x)
  • <, >, & according to this post for entities content

1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.

Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).

So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)

NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.

NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Thank you Aya for the Asci85 link, there are very good ideas.

I developed them below for our case.


UTF-8 characters possibilites:


For 1 byte characters (0xxxxxxx): 96 possibilites per byte

  • + UTF-8 ASCII chars 0xxxxxxx = +2^7
  • - UTF-8 Control chars 000xxxxx = -2^5
  • + XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
  • - XML entity unallowed chars (<, >, &) = -3

EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...

For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes

  • + UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
  • - UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7

For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes

  • + UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
  • - UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
  • - Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11

And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.


The coding possibilites in let's say a 3-byte space


So let's see what combinations we can do on a 3-byte (24bit) space:

  • 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
  • 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
  • 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
  • 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities

There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).

Total number of possibilities:

  • A 3-byte space has 2^24 = 16777216 possibilities.
  • UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.

How much overhead does that mean?

  • 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
  • This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
  • That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is 18% overhead. Way better than Base64's 33% overhead and Ascii85's 25% overhead!

EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.


How to get near this 18% overhead?


The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.

Good coding lengths candidates:

  • 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
  • 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
  • 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
  • 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)

NB: 0,84 is the average "usefulness" of a space bit (20,32/24).


How to build our encoding algorithm?


We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).

The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).

Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.

That's the next challenge for this question. Who wants to to play? :)


A proposal of algorithm, I'll name BaseUTF-8 for XML


20 binary bits to encode: ABCDEFGHIJKLMNOPQRST

Resulting UTF-8 string named "encoded": 24 bits long

Mathematical encoding algorithm (not based on any known programming language):

If GH != 00 && NO != 00:
    encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)

If ABCD != 0000:
    If GH == 00 && NO == 00: # 16 bits to encode
        encoded = 0010ABCD 01EFIJKL 01MPQRST    
    Else If GH == 00:  # 18 bits to encode, 18 space bits with restrictions (1-byte  UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
        encoded = 0NOFIJKL 110ABCDE 10MPQRST
    Else If NO == 00:  # 18 bits to encode
        encoded = 110ABCDE 10MPQRST 0GHFIJKL

If ABCD == 0000: # 16 bits to encode
    encoded = 0011EFGH 01IJKLMN 01OPQRST

On "encoded" variable apply:
    convert < (0x3C) to Line Feed (0x0A)
    convert > (0x3E) to Cariage Return (0x0D)
    convert & (0x26) to TAB (0x09)

And that's how you get a 20% overhead only.

This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...