Python Asyncio tutorial

Note: This tutorial only works beyond python 3.5.3

Python has always been known as a slow language with poor concurrency primitives. Yet no one can deny that python is one of the most well thought out languages. Prototyping in python is way faster than most languages and the rich community and libraries that have grown around it make it a joy to use. Recently I had to write a web scraper for a side project. The scraper uses hackernews firebase API to collect all comments. Unfortunately hackernews API requires us to make a query for each individual comment. Put simply this sucks… items on the front page can accumulate up to 500 comments. That means 500 network requests per story and there are hundreds of stories in the queue that need to be analyzed. doing this on a single thread is slow… A friend of mine implemented a multithreaded version with a 2x speed up but I felt things could be faster. The next morning I ported my old code to use python’s asyncio primitives, and lo behold I managed to cut processing time from 10 minutes down to 35 seconds… Fantastic isn’t it?

Sadly when I was hunting the internet I found very little beginner friendly documentation about asyncio (IMHO the docs are quite good). So here it is.

What is asynchronous IO and asyncio?

The concept behind asynchronous IO is that there are certain functions such as making network calls, reading files from your disk, writing files or making database queries which are slow. When your program makes such a request it stops the whole program until the process is completed.

import requests
def do_something(response):
    ...
r = requests.get('https://www.example.com/')
do_something(r)

In this code do_something will only be called after example.com is loaded. So if it takes 200ms to load example.com your code will be stalled for 200ms. This is not a problem if your code is doing nothing else or if your code only depends on the result of that one request but what if you like me have to handle thousands of such requests? It would take your code 200s to process a thousand such pages. One way to go about it would be to use multiple threads each thread handling a specific web page. This will lead to single digit factor speed ups. Python however is not a very threading friendly language. To start a new thread has a very large overhead and often it may even lead to an overall slow down instead of an improvement. To make matters worse python has something called a GIL which has a negative impact on your code’s ability to multithread. What asynchronous IO does however is even smarter. It lets you use a single thread. If a time consuming operation (AKA blocking operation) is made the interpreter will start the operation then skip over to the part of your code that doesn’t rely on the response from that operation. When the operation is complete it will come back and execute the rest of your code.

OK you’ve ranted enough, just show me how to code it!

In this tutorial I will be using the aiohttp library so you might want to install it if you have not already.

pip install aiohttp

Now we are ready to proceed, lets make an asynchronous request:

import aiohttp
import asyncio
session = aiohttp.ClientSession()

async def get_page(url):
     f = await session.get(url)
     print("completed request to %s"%url)
     return f

Notice the two new keywords, async and await? These make a function asynchronous. meaning that once the interpreter hits the line await, the interpreter will stop the function and carry on to do something else till session.get() is completed. Internally what happens is when you call get_page(“www.example.com”), it returns what is known as a future. A future is similar to a python generator it just returns an object which is awaitable. i.e. Rather than execute the whole function it returns the sate at which we are currently awaiting and an event handler that tells us how to proceed when the said function completes.  You can only use await inside an async function definition.

But wait you haven’t shown me how to call it…

Enter Event Loops. 

Event loop is essentially a loop that keeps waiting for events as its name implies. For instance when the computer has received a response in the server it interrupts the event loop to call the appropriate event handler. In things like GUI toolkits the main function is a giant event loop to handle all sorts of events coming in from the keyboard and mouse. In order to get our asynchronous function to do our bidding we need to create an event loop.

loop = asyncio.get_event_loop() 
future = get_page("www.example.com")
loop.run_until_complete(future)

Now we can execute the code and see we have successfully made a network request.  loop.run_until_complete() will keep running until the request is complete. Your code up to now should look something like this:

import aiohttp
import asyncio
session = aiohttp.ClientSession()

async def get_page(url):
     f = await session.get(url)
     print("completed request to %s"%url)
     return f

loop = asyncio.get_event_loop() 
future = get_page("www.example.com")
loop.run_until_complete(future)

It doesn’t seem any faster than my request example!

Making multiple requests in parallel

So remember how I said this can lead to a ridiculous increase in speed. Well it only does so when you are doing multiple things at the same time. Now let me introduce a simple function: asyncio.gather(*futures). What asyncio.gather(*futures) does is it chains together multiple futures thus allowing you to control them all at once. Lets modify our code like so:

import aiohttp
import asyncio
session = aiohttp.ClientSession()

async def get_page(url):
     f = await session.get(url)
     print("completed request to %s"%url)
     return f

loop = asyncio.get_event_loop() 
future1 = get_page("www.example.com")
future2 = get_page("www.example1.com")
loop.run_until_complete(asyncio.gather(future1,future2))

Now run this code. Notice that both futures start around the same time but which one returns first is not guaranteed. This is because future1 runs in parallel to future2.

What if I want to modify the list of futures running in parallel?

You can.  Lets say you are building a web crawler like google and you want to be able to add new things to a breadth first search. This can be done using the asyncio.wait() function.

import aiohttp
import asyncio
session = aiohttp.ClientSession()

async def get_page(url):
     f = await session.get(url)
     print("completed request to %s"%url)
     return f

loop = asyncio.get_event_loop() 
future1 = get_page("www.example.com")
future2 = get_page("www.example1.com")
old_timers = asyncio.gather(future1,future2)
completed, incomplete = loop.run_until_complete(asyncio.wait(old_timers, return_when=asyncio.FIRST_COMPLETED))
for task in completed:
       print(task.result())
new_comer = get_page("www.google.com")
loop.run_until_complete(asyncio.gather([new_comer]+incomplete))

Here you can see a much more complex example. where while we are running the event loop we are injecting new futures inside. Notice how to retrieve the results one must call the .result() function.

Where to go from here:

Advertisements