python - how to deal with ® in url for urllib2.urlopen?

Question

Welcome To Ask or Share your Answers For Others

python - how to deal with ® in url for urllib2.urlopen?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - how to deal with ® in url for urllib2.urlopen?

I received a url: https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp?-75-desktop-virtualization-solutions; it is from BeautifulSoup.

url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenappxae-75-desktop-virtualization-solutions'

I want to feed back into urllib2.urlopen again.

import urllib2
source = urllib2.urlopen(url).read()

The error I get:

UnicodeEncodeError: 'gbk' codec can't encode character u'xae' in position 43: illegal multibyte sequence

Thus, I tried:

source = urllib2.urlopen(url.encode("utf-8")).read()

It got page source, however it is different from what from the original url.

originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp?-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source

The result is False. Is there any idea to fix this url? How to convert u'xae' into original ??

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:33:50+0000

URLs must be valid bytestring, with non-ASCII codepoints encoded correctly. You'll need to encode to UTF-8, then url quote the path of your URL:

import urllib
import urllib2
import urlparse

originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenappxae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()

Demo:

>>> import urllib
>>> import urllib2 
>>> import urlparse
>>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenappxae-75-desktop-virtualization-solutions'
>>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> encoded_link = parsed_link.geturl()
>>> encoded_link
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
>>> source = urllib2.urlopen(encoded_link).read()
>>> len(source)
68758

Categories

python - how to deal with ® in url for urllib2.urlopen?

python - how to deal with ® in url for urllib2.urlopen?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

python - how to deal with &#174; in url for urllib2.urlopen?

python - how to deal with &#174; in url for urllib2.urlopen?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

python - how to deal with ® in url for urllib2.urlopen?

python - how to deal with ® in url for urllib2.urlopen?