Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
238 views
in Technique[技术] by (71.8m points)

python - Removing u2018 and u2019 character

I am using Beautiful Soup to parse webpages and printing the name of the webpages visited on the terminal. However, often the name of the webpage has single left (u2018) and right(u2019) character which the python can't print as it gives charmap encoding error. Is there any way to remove these characters?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

These codes are Unicode for the single left and right quote characters. You can replace them with their ASCII equivalent which Python shouldn't have any problem printing on your system:

>>> print u"u2018Hiu2019"
‘Hi’
>>> print u"u2018Hiu2019".replace(u"u2018", "'").replace(u"u2019", "'")
'Hi'

Alternatively with regex:

import re
s = u"u2018Hiu2019"
>>> print re.sub(u"(u2018|u2019)", "'", s)
'Hi'

However Python shouldn't have any problem printing the Unicode version of these as well. It's possible that you are using str() somewhere which will try to convert your unicode to ascii and throw your exception.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...