I generally love the synchronous and predictable nature of Python. No promises, callbacks and nested code. Life’s good. Elegant and no time spent fixing async bugs. I rarely ever miss JavaScript’s async nature. Moreover I’ve hardly needed to use threads in Python.
However when using Python for network or disk intensive tasks, you start feeling the pain. JavaScript’s async nature would obviously make it much faster and you can imagine the time you’d save if you’d have say scrape large amounts of data and write them to disk on a regular basis. Also at times you might need threads to perform complex calculations faster.
Do keep in mind though - Python is limited by a Global Interpreter Lock (GIL). Threads you create don’t really execute code in a parallel fashion.
Creating Threads
Having additional threads can significantly increase the complexity of your program. You will need to ensure that they don’t compete for the same resources at the same time (race conditions), Global Interpreter Locks in Python, ensure that certain statements execute only after all the threads are done executing their assigned tasks and so on.
import threading
import requests
def fetch(url):
response = requests.get(url)
print "Got the url you asked for\n Status Code:-"
print response.status_code
thread = threading.Thread(target=fetch, args=('http://www.example.com',))
thread.start()
print "Exit"
In the example above print "Exit"
is called as soon the thread is created and will appear on the console before the print "Got the url you asked for
. We shall see how to overcome this in the join
section.
Now let’s try creating an arbitrary number of threads and also store them so that we can do other stuff with them later on.
import threading
import requests
def fetch(url, thread_number):
response = requests.get(url)
print "\n".join(["Hi, I'm thread number " + str(thread_number), "Here's the status code " + str(response.status_code)])
thread_list = []
for i in xrange(0, 5):
thread_list.append(threading.Thread(target=fetch, args=('http://www.example.com', i,)))
thread_list[i].start()
Notice how we store the threads in a list. This time I’m also supplying an integer so that we can identify them. Hopefully it’ll make things easier later on.
Joins
You also need a mechanism to control the flow of code. Say you want to read the file only after a thread has made a network request and written something to file. Jump back to the first example on the page. print "Exit"
Once a join statement is called in context to a thread, execution pauses unless the thread is done with it’s task. Continuing example one.
import threading
import requests
def fetch(url):
response = requests.get(url)
print "Got the url you asked for\n Status Code:-"
print response.status_code
thread = threading.Thread(target=fetch, args=('http://www.example.com',))
thread.start()
thread.join() # Code after this gets executed only after the thread is done executing.
print "Exit"
Comparing total time taken
Without using threads.
Time Taken - 25 seconds
import threading
import requests
import time
tic = time.time()
def fetch(url, thread_number):
return requests.get(url)
for i in xrange(0, 25):
print str(i) + " " + str(fetch('http://www.example.com', i).status_code)
toc = time.time()
print "Time Taken:- " + str(round(toc - tic, 2)) + " seconds"
Using threads. I’ve blindly used 25 threads. In practice the number of threads are only set after careful analysis and trial and error.
Time Taken - 6 seconds
import threading
import requests
import time
tic = time.time()
def fetch(url, thread_number):
status_code = requests.get(url).status_code
print str(thread_number) + " " + str(status_code)
thread_list = []
for i in xrange(0, 25):
thread_list.append(threading.Thread(target=fetch, args=('http://www.example.com', i,)))
thread_list[i].start()
for i in xrange(0, 25):
thread_list[i].join()
toc = time.time()
print "Time Taken:- " + str(round(toc - tic, 2)) + " seconds"
In the example above I use multiple join statements in a loop. Let me brief you about the flow.
.join()
is called on the first thread. So the main thread is blocked until thread one finishes execution.- Now the interpreter tries to
.join()
the second thread. Let’s assume it has already finished execution. It does not block the main thread and immediately calls.join()
on the third thread. - This continues till the 25th thread.
Try Making Them Functional
You can reduce complexity when using threads if you reduce their impact on common state. Eg:- If you create threads that access the same file, you might run into race conditions. You would have to lock and unlock the resource and so on. You could perhaps structure them in such a way that they read and write to their own files.
If performing complex calculations using threads, organise them in such a way that they take in inputs and give out outputs. That’s it. By inputs and outputs I mean they shouldn’t change the value of a variable or write disk to file or do anything that might effect something else. Think about it. By having functions that don’t change anything externally, you greatly reduce the complexity of your code and you also make testing easier.
How Many Threads?
There’s no magic number. Also the advise to create threads equivalent to 2 * number of CPU cores is hardly applicable and you should refrain from following it. Also be aware of the Global Interpreter Lock. I’d suggest reading the following StackOverflow answers How many threads is too many? and The right way to limit maximum number of threads.