Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
195 views
in Technique[技术] by (71.8m points)

string - Translate UTF-8 character encoding function from PHP to Java

I am trying to translate one PHP encoding function to Android Java method. Because Java string length function handles UTF-8 string differently. I failed to make the translated Java codes consistent with PHP code in converting the second UTF-8 str2. The first non UTF-8 string does work.

The original PHP codes are :

 function myhash_php($string,$key) {
    $strLen = strlen($string);
    $keyLen = strlen($key);
    $j=0 ; $hash = "" ; 
    for ($i = 0; $i < $strLen; $i++) {
        $ordStr = ord(substr($string,$i,1));
        if ($j == $keyLen) { $j = 0; }
        $ordKey = ord(substr($key,$j,1));
        $j++;
        $hash .= strrev(base_convert(dechex($ordStr + $ordKey),16,36));

    }
    return $hash;  
}
$str1 = "good friend" ;
$str2 = "好友" ;    //  strlen($str2) == 6
$key  = "iuyhjf476" ;
echo "php encode str1 '". $str1 ."'=".myhash_php($str1, $key)."<br>";
echo "php encode str2 '". $str2 ."'=".myhash_php($str2, $key)."<br>";

PHP output are:

    php encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
    php encode str2 '好友'=a9u7m899x6p6

Current translated Java codes that produce wrong result are:

    public static String   hash_java(String  string, String  key) {
        //Integer strLen  = byteLenUTF8(string) ; // consistent with php strlen("好友")==6
        //Integer keyLen  = byteLenUTF8(key) ;    //   byteLenUTF8("好友") == 6
        Integer strLen  = string.length() ;      //     "好友".length()  ==  2
        Integer keyLen  = key.length() ;
        int j=0 ;
        String  hash = "" ;
        int ordStr, ordKey ;
        for (int i = 0; i < strLen; i++) {
            ordStr = ord_java(string.substring(i,i+1));  //string is String,  php  substr($string,$i,$n)  ==  java string.substring(i, i+n)
            // ordStr = ord_java(string[i]);  //string is byte[], php  substr($string,$i,$n)  ==  java string.substring(i, i+n)
            if (j == keyLen) { j = 0; }
            ordKey = ord_java(key.substring(j,j+1));
            j++;
            hash += strrev(base_convert(dechex(ordStr + ordKey),16,36));
        }
        return hash;
    }
    // return the ASCII code of the first character of str
    public static int      ord_java( String str){
        return( (int)  str.charAt(0)  ) ;
    }
    public static String   dechex(int input  ) {
        String hex  = Integer.toHexString(input ) ;
        return hex ;
    }
    public static String   strrev(String str){
        return  new StringBuilder(str).reverse().toString() ;
    }
    public static String   base_convert(String str, int fromBase, int toBase) {
        return Integer.toString(Integer.parseInt(str, fromBase), toBase);
    }

    String  str1 = "good friend" ;
    String  str2 = "好友" ;
    String  key  = "iuyhjf476" ;
    Log.d(LogTag,"java encode str1 '"+ str1  +"'="+hash_java(str1, key)) ;
    Log.d(LogTag,"java encode str2 '"+ str2  +"'="+hash_java(str2, key)) ;

Java output are:

java encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
java encode str2 '好友'=arh4ng

The encoded output of UTF-8 str2 in Java method is not correct. How to fix the problem?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Do not use literals for testing - this is prone to yield unexpected results if not fully being aware of what you do and how the file is encoded. For UTF-8 you should everything treat as raw bytes and never use a String for en/decoding. Example in PHP:

$test1 = pack( 'H*', '414243' );  // "ABC" in hexadecimal: 2 digits per byte
$test2 = pack( 'H*', 'e5a5bde58f8b' );  // "好友" in hexadecimal, UTF-8 encoded, 3 bytes per character

Example in Java:

byte[] test1 = new byte[] { 0x41, 0x42, 0x43 };  // "ABC"
byte[] test2 = new byte[] { (byte)0xe5, (byte)0xa5, (byte)0xbd, (byte)0xe5, (byte)0x8f, (byte)0x8b };  // "好友"

Only this way you can make sure your test is set up correctly and unbound to how the source file is encoded. If your Java file is encoded in UTF-8 and your PHP file is encoded in UTF-16LE then you'd fail even worse, simply because you didn't separate between definition (raw bytes) and assumption (strings based on the text encoding) so far.

(This is also a big misunderstanding when people want to en/decrypt texts: they operate on (any programming language's) String rather than the actual bytes and then wonder why different results occur with a different programming language.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...