Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.1k views
in Technique[技术] by (71.8m points)

python - How to change specific url to text usin re module?

I have text. For example:

<a href="https://google.com">Google</a> Lorem ipsum dolor 
sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
<br />
<br />
#<a href="#something">somethin</a> #<a href="#somethingelse">somethinelse</a>"

I want change urls with "#" to a normal text (ex. with <b></b> tags). The others urls should be unchanged.

I tried to use the re module, but the result was not quite successful.

import re
cond = re.compile('#<.*?>')
output = re.sub(cond, "#", "#<a href="stuff1">stuff1</a>")
print(output)

Output:

#stuff1</a>

still remains </a> at the end of text.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You're close! Your pattern, '#<.*?>', only matches the opening tag. Try this:

r'#<a href=".*?">(.*?)</a>'

This is also a little more specific, in that it will only match <a> tags. Also note that it's good practice to specify regular expressions as raw string literals (the r at the beginning). The parenthesis, (.*?), are a capturing group. From the docs:

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the umber special sequence, described below.

You can refer back to this group in your replacement argument as g<#>, where # is which group you want. We've only defined one group, so it's naturally the first one: g<1>.

Additionally, once you've compiled a regular expression, you can call its own sub method:

pattern = re.compile(r'my pattern')
pattern.sub(r'replacement', 'text')

Usually the re.sub method is for when you haven't compiled:

re.sub(r'my pattern', r'replacement', 'text')

Performance difference is usually none or minimal, so use whichever makes your code more clearer. (Personally I usually prefer compiling. Like any other variables, they let me use clear, reusable names.)

So your code would be:

import re

pound_links = re.compile(r'#<a href=".*?">(.*?)</a>')
output = pound_links.sub(r'#g<1>', '#<a href="stuff1">stuff1</a>')

print(output)

Or:

import re

output = re.sub(r'#<a href=".*?">(.*?)</a>',
                r"#g<1>",
                "#<a href="stuff1">stuff1</a>")

print(output)

Either one outputs:

#stuff1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...