python - asyncio web scraping 101: fetching multiple urls with aiohttp

Question

Welcome To Ask or Share your Answers For Others

python - asyncio web scraping 101: fetching multiple urls with aiohttp

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - asyncio web scraping 101: fetching multiple urls with aiohttp

In earlier question, one of authors of aiohttp kindly suggested way to fetch multiple urls with aiohttp using the new async with syntax from Python 3.5:

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

However when one of the session.get(url) requests breaks (as above because of http://SDFKHSKHGKLHSKLJHGSDFKSJH.com) the error is not handled and the whole thing breaks.

I looked for ways to insert tests about the result of session.get(url), for instance looking for places for a try ... except ..., or for a if response.status != 200: but I am just not understanding how to work with async with, await and the various objects.

Since async with is still very new there are not many examples. It would be very helpful to many people if an asyncio wizard could show how to do this. After all one of the first things most people will want to test with asyncio is getting multiple resources concurrently.

Goal

The goal is that we can inspect the_results and quickly see either:

this url failed (and why: status code, maybe exception name), or
this url worked, and here is a useful response object

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:45:29+0000

I would use gather instead of wait, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception.

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
        return_exceptions=True  # default is false, that would raise
    )

    # for testing purposes only
    # gather returns results in the order of coros
    for idx, url in enumerate(urls):
        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = [
        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
        'http://google.com',
        'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))

Tests:

$python test.py 
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK

Categories

python - asyncio web scraping 101: fetching multiple urls with aiohttp

python - asyncio web scraping 101: fetching multiple urls with aiohttp

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags