Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
300 views
in Technique[技术] by (71.8m points)

php - DOMDocument encoding problems / characters transformed

I am using DOMDocument to manipulate / modify HTML before it gets output to the page. This is only a html fragment, not a complete page. My initial problem was that all french character got messed up, which I was able to correct after some trial-and-error. Now, it seems only one problem remains : ' character gets transformed into ? .

The code :

<?php
    $dom = new DOMDocument('1.0','utf-8');
         $dom->loadHTML(utf8_decode($row->text));

         //Some pretty basic modification here, not even related to text

         //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
         $row->text = utf8_encode(preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML())));
?>

I know it's getting messy with the utf8 decode/encode, but this is the only way I could make it work so far. Here is a sample string :

Input : Sans doute parce qu’il vient d’atteindre une date déterminante dans son spectaculaire cheminement

Output : Sans doute parce qu?il vient d?atteindre une date déterminante dans son spectaculaire cheminement

If I find any more details, I'll add them. Thank you for your time and support!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Don't use utf8_decode. If your text is in UTF-8, pass it as such.

Unfortunately, DOMDocument defaults to LATIN1 in case of HTML. It seems the behavior is this

  • If fetching a remote document, it should deduce the encoding from the headers
  • If the header wasn't sent or the file is local, look for the correspondent meta-equiv
  • Otherwise, default to LATIN1.

Example of it working:

<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);

echo $d->textContent;

And with XML (default is UTF-8):

<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
    'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);

echo $d->textContent;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...