Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
735 views
in Technique[技术] by (71.8m points)

html - Python follow redirects and then download the page?

I have the following python script and it works beautifully.

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data. For instance when using the above code with

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

which is the equvilant of hitting the im lucky button on a google search, I get:

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> 

Ive tried the (url, data, timeout) however, I am unsure what to put there.

EDIT: I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You might be better off with Requests library which has better APIs for controlling redirect handling:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

Requests:

https://pypi.org/project/requests/ (urllib replacement for humans)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...