php - Regex / DOMDocument - match and replace text not in a link

Question

Welcome To Ask or Share your Answers For Others

php - Regex / DOMDocument - match and replace text not in a link

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

php - Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:

<p>Match this text and replace it</p>
<p>Don't <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>

Searching for 'match this text' would only replace the first instance and last instance.

[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:34:53+0000

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

<?php
$html = '<p>Match this text and replace it</p>
<p>Don't <a href="/">match this text</a></p>
<p>We still need to match this text and replace it??</p>
<p>This is <a href="#">a link <span>with <strong>don't match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?

I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

Thanks for Gordon and stillstanding for commenting on my other answer.

Categories

php - Regex / DOMDocument - match and replace text not in a link

php - Regex / DOMDocument - match and replace text not in a link

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags