Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
507 views
in Technique[技术] by (71.8m points)

php - Proper way to decode incoming email subject (utf 8)

I'm trying to pipe my incoming mails to a PHP script so I can store them in a database and other things. I'm using the class MIME E-mail message parser (registration required) although I don't think that's important.

I have a problem with email subjects. It works fine when the title is in English but if the subject uses non-latin Characters I get something like

=?UTF-8?B?2KLYstmF2KfbjNi0?=

for a title like ?? ?? ??

I decode the subject like this:

  $subject  = str_replace('=?UTF-8?B?' , '' , $subject);
  $subject  = str_replace('?=' , '' , $subject);      
  $subject = base64_decode($subject); 

It works fine with short subjects with like 10-15 characters but with a longer title I get half of the original title with something like ??? at the end.

If the title is even longer, like 30 characters, I get nothing. Am I doing this right?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Despite the fact that this is almost a year old - I found this and am facing a similar problem.

I'm unsure why you're getting odd characters, but perhaps you are trying to display them somewhere your charset is unsupported.

Here's some code I wrote which should handle everything except the charset conversion, which is a large problem that many libraries handle much better. (PHP's MB library, for instance)

class mail {
    /**
      * If you change one of these, please check the other for fixes as well
     *
     * @const Pattern to match RFC 2047 charset encodings in mail headers
     */
    const rfc2047header = '/=?([^ ?]+)?([BQbq])?([^ ?]+)?=/';

    const rfc2047header_spaces = '/(=?[^ ?]+?[BQbq]?[^ ?]+?=)s+(=?[^ ?]+?[BQbq]?[^ ?]+?=)/';

    /**
     * http://www.rfc-archive.org/getrfc.php?rfc=2047
     *
     * =?<charset>?<encoding>?<data>?=
     *
     * @param string $header
     */
    public static function is_encoded_header($header) {
        // e.g. =?utf-8?q?Re=3a=20Support=3a=204D09EE9A=20=2d=20Re=3a=20Support=3a=204D078032=20=2d=20Wordpress=20Plugin?=
        // e.g. =?utf-8?q?Wordpress=20Plugin?=
        return preg_match(self::rfc2047header, $header) !== 0;
    }

    public static function header_charsets($header) {
        $matches = null;
        if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_PATTERN_ORDER)) {
            return array();
        }
        return array_map('strtoupper', $matches[1]);
    }

    public static function decode_header($header) {
        $matches = null;

        /* Repair instances where two encodings are together and separated by a space (strip the spaces) */
        $header = preg_replace(self::rfc2047header_spaces, "$1$2", $header);

        /* Now see if any encodings exist and match them */
        if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_SET_ORDER)) {
            return $header;
        }
        foreach ($matches as $header_match) {
            list($match, $charset, $encoding, $data) = $header_match;
            $encoding = strtoupper($encoding);
            switch ($encoding) {
                case 'B':
                    $data = base64_decode($data);
                    break;
                case 'Q':
                    $data = quoted_printable_decode(str_replace("_", " ", $data));
                    break;
                default:
                    throw new Exception("preg_match_all is busted: didn't find B or Q in encoding $header");
            }
            // This part needs to handle every charset
            switch (strtoupper($charset)) {
                case "UTF-8":
                    break;
                default:
                    /* Here's where you should handle other character sets! */
                    throw new Exception("Unknown charset in header - time to write some code.");
            }
            $header = str_replace($match, $data, $header);
        }
        return $header;
    }
}

When run through a script and displayed in a browser using UTF-8, the result is:

??????

You would run it like so:

$decoded = mail::decode_header("=?UTF-8?B?2KLYstmF2KfbjNi0?=");

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...