Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
116 views
in Technique[技术] by (71.8m points)

How can I create a Python Script with BeautifulSoup on Windows to download the highest resolution of each picture in a WIkimedia Commons folder?

So, I'm a big fan of Gustave Doré, and I would like to download all his engravings from the Wikimedia Commons folders that are neatly organized.

So, given a Wikimedia Commons folder I need to download all the pictures in it in the highest resolution.

I started writing something, but I'm not that good, so it's just a template:

import os, requests, bs4

url = 'URL OF THE WIKIMEDIA COMMONS FOLDER'

os.makedirs('NAME OF THE FOLDER', exist_ok=True)
for n in range(NUMBER OF PICTURES IN THE PAGE - 1):
    print('I am downloading page number %s...' %(n+1))
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    #STUFF I STILL NEED TO ADD
    
print('Done')

For example, I would feed this as the URL of the folder:

Then I would like to click every link and go to the picture page, like this one:

And then download the 'original file' by clicking the link below the picture that says 'original file'. Except sometimes the pic has no higher resolution available, like in this case:

And it would just need to click the link below the picture to download it.

I am completely stuck, thanks in advance for your help!

Bonus points if the pic has the name stated in its page when saved

(e.g. in the second link the picture should be saved as 'Astonishment of the Crusaders at the Wealth of the East.jpg')

question from:https://stackoverflow.com/questions/65923348/how-can-i-create-a-python-script-with-beautifulsoup-on-windows-to-download-the-h

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Hey big fan of Gustave Doré, here is a way you can do it

r = requests.get('https://commons.wikimedia.org/wiki/Category:Crusades_by_Gustave_Dor%C3%A9')
soup = BeautifulSoup(r.text, 'html.parser')
links = [i.find('img').get('src') for i in soup.find_all('a', class_='image')]
links = ['/'.join(i.split('/')[:-1]).replace('/thumb', '') for i in links]
for l in links:
    im = requests.get(l)
    with open(l.split('/')[-1], 'wb') as f:
        f.write(im.content)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...