Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Question

Welcome To Ask or Share your Answers For Others

Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Python's urllib.quote and urllib.unquote do not handle Unicode correctly in Python 2.6.5. This is what happens:

In [5]: print urllib.unquote(urllib.quote(u'Cata?o'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

/home/kkinder/<ipython console> in <module>()

/usr/lib/python2.6/urllib.pyc in quote(s, safe)
   1222             safe_map[c] = (c in safe) and c or ('%%%02X' % i)
   1223         _safemaps[cachekey] = safe_map
-> 1224     res = map(safe_map.__getitem__, s)
   1225     return ''.join(res)
   1226 

KeyError: u'xc3'

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cata?o'.encode('utf8')))
Cata?±o

It's recognized as a bug and there is a fix, but not for my version of Python.

What I'd like is something similar to urllib.quote/urllib.unquote, but handles unicode variables correctly, such that this code would work:

decode_url(encode_url(u'Cata?o')) == u'Cata?o'

Any recommendations?

question from:https://stackoverflow.com/questions/5557849/is-there-a-unicode-ready-substitute-i-can-use-for-urllib-quote-and-urllib-unquot

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T17:19:38+0000

Python's urllib.quote and urllib.unquote do not handle Unicode correctly

urllib does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.

IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib.

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cata?o'.encode('utf8')))
Cata?±o

Ah, well now you're typing Unicode into a console, and doing print-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.

Type it out the long way with backslash sequences and you can more easily see that the urllib bit does actually work:

>>> u'Catau00F1o'.encode('utf-8')
'CataxC3xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'

>>> urllib.unquote(_)
'CataxC3xB1o'
>>> _.decode('utf-8')
u'CataxF1o'

Categories

Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags