Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
273 views
in Technique[技术] by (71.8m points)

Switching to Python 3 causing UnicodeDecodeError

I've just added Python3 interpreter to Sublime, and the following code stopped working:

for directory in directoryList:
    fileList = os.listdir(directory)
    for filename in fileList:
        filename = os.path.join(directory, filename)
        currentFile = open(filename, 'rt')
        for line in currentFile:               ##Here comes the exception.
            currentLine = line.split(' ')
            for word in currentLine:
                if word.lower() not in bigBagOfWords:
                    bigBagOfWords.append(word.lower())
        currentFile.close()

I get a following exception:

  File "/Users/Kuba/Desktop/DictionaryCreator.py", line 11, in <module>
    for line in currentFile:
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 305: ordinal not in range(128)

I found this rather strange, because as far as I know Python3 is supposed to support utf-8 everywhere. What's more, the same exact code works with no problems on Python2.7. I've read about adding environmental variable PYTHONIOENCODING, but I tried it - to no avail (however, it appears it is not that easy to add an environmental variable in OS X Mavericks, so maybe I did something wrong with adding the variable? I modidified /etc/launchd.conf)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns 'ASCII'. See the open() function documenation:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

Instead of relying on a system setting, you should open your text files using an explicit codec:

currentFile = open(filename, 'rt', encoding='latin1')

where you set the encoding parameter to match the file you are reading.

Python 3 supports UTF-8 as the default for source code.

The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.

You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...