Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
820 views
in Technique[技术] by (71.8m points)

encoding - Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.

I am trying to print a string but when printed it doesn't show special characters (e.g. ?, ?, ?, ? and ü). When I print the string using repr() this is what I get:

u'Von Dxc3xbc' and u'xc3x96berg'

Does anyone know how I can convert this to Von Dü and ?berg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").

EDIT

This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.

web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
    i = 0
    # Iterate over < td> to find time and name
    for td in tables[scene_table].find_all("td"):
        if i % 2 == 0:  # td contains the time
            time = remove_whitespace(td.get_text())
        else:           # td contains the name
            name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
            print "%s: %s" % (time, name,)
        i += 1
    scene_index += 1
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:

your_unicode_string =  original_utf8_encoded_bytestring.decode('latin1')

The cure is to reverse the process, simply, and then decode.

correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')

Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.

If you can't show it, read the BS docs; it looks like you'll need to use:

BeautifulSoup(web, from_encoding='utf8')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...