Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
846 views
in Technique[技术] by (71.8m points)

regex - Regular expression replace a word by a link

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.

Example:

    i'm living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>,  i love Paris.

would become

    i'm living.........near <a href="">Paris</a>..........i love <a href="">Paris</a>.
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is hard to do in one step. Writing a single regex that does that is virtually impossible.

Try a two-step approach.

  1. Put a link around every "Paris" there is, regardless if there already is another link present.
  2. Find all incorrectly nested links (<a href="..."><a href="...">Paris</a></a>), and eliminate the inner link.

Regex for step one is dead-simple:

Paris

Regex for step two is slightly more complex:

(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>

Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.

Explanation of regex #2 in plain words:

  • Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
  • Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
  • Now look for the word Paris. Save it into match group 2.
  • Look for a closing link (</a>). Make sure it is there, but don't save it.
  • Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.

The approach assumes these side conditions:

  • Your input HTML is not horribly broken.
  • Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
  • You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
  • BTW: regex #2 explicitly allows for constructs like this:

    <a href="">in the <b>capital of France</b>, <a href="">Paris</a></a>

    The surplus link comes from step one, replacement result of step 2 will be:

    <a href="">in the <b>capital of France</b>, Paris</a>


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...