php - loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Question

Welcome To Ask or Share your Answers For Others

php - loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

php - loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Using the LIBXML_HTML_NOIMPLIED flag with an html fragment generates incorrect tags:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();

Outputs:

<p>Lorem ipsum dolor sit amet.<p>Nunc vel vehicula ante.</p></p>

I have found hacks to work around this using regexes, but that defeats the purpose of using DOM. I have tested this with several versions of libxml and php, the latest with libxml 2.9.2, php 5.6.7 (Debian Jessy). Any suggestions appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:06:51+0000

The re-arrangement is done by the LIBXML_HTML_NOIMPLIED option you're using. Looks like it's not stable enough for your case.

Also you might want to not use it for portablility reasons, for example I've got one PHP 5.4.36 with Libxml 2.7.8 at hand that is not supporting LIBXML_HTML_NOIMPLIED (Libxml >= 2.7.7) but later LIBXML_HTML_NODEFDTD (Libxml >= 2.7.8) option.

I know this way of dealing with it. When you load the fragment, you wrap it into a <div> element:

$doc->loadHTML("<div>$str</div>");

This helps to guide DOMDocument on the structure you want.

You can then extract this container from the document itself:

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);

And then remove all children from the document:

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

Now the document is completely empty and you're now able to append children again. Luckily there is the <div> container element we removed earlier, so we can add from it:

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

The fragment then can be retrieved with the known saveHTML method:

echo $doc->saveHTML();

Which gives in your scenario:

<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>

This methodology is a little different from the existing material here on site (see the references I give below), so the example at once:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

echo $doc->saveHTML();

I also really recommend the reference question on How to saveHTML of DOMDocument without HTML wrapper? for a further read as well as the one about inner-html

Categories

php - loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

php - loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

References

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags