Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
951 views
in Technique[技术] by (71.8m points)

unicode - Python 3 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d

I want to make search engine and I follow tutorial in some web. I want to test parse html

from bs4 import BeautifulSoup

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""
    with open(filename) as infile:
        html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
        d = {'text': html.pre.text}
        if html.title is not None:
            d['title'] = html.title.text
        for meta in html.findAll('meta'):
            try:
                if meta['name'] in ('Author', 'Title'):
                    d[meta['name'].lower()] = meta['content']
            except KeyError:
                continue
        return d

parse_html("C:\pdf\pydf\data\muellner2011.html")

and it getting error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here

I saw some solutions on the Web using the encode(). But I don't know how to insert encode() function in code. Can anyone help me?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In Python 3, files are opened as text (decoded to Unicode) for you; you don't need to tell BeautifulSoup what codec to decode from.

If decoding of the data fails, that's because you didn't tell the open() call what codec to use when reading the file; add the correct codec with an encoding argument:

with open(filename, encoding='utf8') as infile:
    html = BeautifulSoup(infile, "html.parser")

otherwise the file will be opened with your system default codec, which is OS dependent.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...