Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
738 views
in Technique[技术] by (71.8m points)

403 forbidden error when crawling a website using python requests on Heroku

I have written a crawler script that sends a post request to "sci-hub.do" and i've set it up running on Heroku . But when it tries to send a post or get request, i mostly get 403 forbidden message.

Strange thing is that this only happens when that script is running on Heroku cloud and when i run it on my PC it's all good and i get the 200 status code.

I have tried using a session but it did not work. I also checked robots.txt of that website and set a User-Agent header to "Twitterbot/1.0" but it still failed.

What am i doing wrong? Why is it only happening when the script is running on Heroku.

I'm pretty sure that the webserver is detecting my script as a crawler bot and tries to block it. But why even after adding a proper "User-agent"?

question from:https://stackoverflow.com/questions/65925003/403-forbidden-error-when-crawling-a-website-using-python-requests-on-heroku

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try to add a usual User Agent like:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36

Maybe you will need to use a random User-Agent for every request. Then you can install and use https://pypi.org/project/fake-useragent/


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...