How To Scrape Data Using Multiple Threads?

For this post we will use Python but we can replicate it using any language with a library similar to requests. If you are used to scraping data using selenium, unfortunately selenium isn’t thread-safe. This means you cannot use multi-threading on selenium, well you technically can, but then it will be a complete mess because it wasn’t designed to be multithreaded.

Quick refresher

Now with that out of the way let’s recap a bit with python requests and the basics. Always head over to the official docs maintained by creators of the library if you have problems or if you want more detailed information.

import requests
from bs4 import BeautifulSoup

scrapeWebsite = "https://abcnews.go.com/"

r = requests.get(scrapeWebsite)
h1Headings = []

soup = BeautifulSoup(r.text,features='html.parser')
headings = soup.find_all('h1')
for heading in headings:
    if(heading.text != ""):
        h1Headings.append(str(heading.text.replace("\n","")))

We will use beautifulsoup, it is one of the best libraries out there for web scraping and easy to get up and running. We’re scraping data from abcnews in this script just as an example. If you run this script, it will show you all the h1 tags present on the page as it loaded. However, if you added some iteration by navigating into the links and fetching headings from all articles, it can take quite some time depending on the depth of the iteration.

Now if you have hundreds of sites you want to fetch data from, it’s going to be excruciatingly slow and inefficient unless you have a fast internet connection and a powerful CPU. Even then if you want to scrape the entire web iteratively, it will be slow.

In this post we are using the threading library to create multiple threads to increase the scraping speed significantly.

Multi-Threading

import requests
import threading
from bs4 import BeautifulSoup

h1Headings = []

def scrapeHeadings(website):
    r = requests.get(website)

    soup = BeautifulSoup(r.text,features='html.parser')
    headings = soup.find_all('h1')
    for heading in headings:
        if(heading.text != ""):
            h1Headings.append(str(heading.text.replace("\n","")))

thread1 = threading.Thread(target=scrapeHeadings,args=("https://abcnews.go.com/",))
thread2 = threading.Thread(target=scrapeHeadings,args=("https://www.aljazeera.com/",))

thread1.start()
thread2.start()

thread1.join()
thread2.join()

With two threads our little script can now run much faster. From here you can do many things. You could create a function that starts the threads by providing it a list of links to scrape from. You could also limit amount of threads being created. If you have a set amount of data you want to scrape this will work perfectly fine.

Multi-Processing

import requests
import multiprocessing
from bs4 import BeautifulSoup

h1Headings = []

def scrapeHeadings(website):
    r = requests.get(website)

    soup = BeautifulSoup(r.text,features='html.parser')
    headings = soup.find_all('h1')
    for heading in headings:
        if(heading.text != ""):
            print(str(heading.text.replace("\n","")))
if __name__ == '__main__':
    process1 = multiprocessing.Process(target=scrapeHeadings,args=("https://abcnews.go.com/",))
    process2 = multiprocessing.Process(target=scrapeHeadings,args=("https://www.aljazeera.com/",))

    process1.start()
    process2.start()

    process1.join()
    process2.join()

You might have noticed apart from the last two lines there isn’t much of a difference symantically. The main difference between multi threading and multiprocessing is that threading is lighter on the memory while since multi processing uses a different memory space it is much easier to use and can handle much more load since it also uses all the different cores but at the cost of a larger amount of memory being used.

At the end of the program joining the processes we spawn back to the main process is crucial.

When should you use multiprocessing over threading?

Processing almost always will be faster than using threading. Unless you are spawning thousands of threads multi-processing is the way to go. But in reality, since processes have separate memory, sharing objects between them becomes very difficult. Therefore, instead of appending our data to h1Headings we instead have to print the output. There are many solutions to extract the data from processes such as exporting it to files.

Most often if you are scraping a middle-small sized website you will almost always use multithreading over processing though.

Leave a Comment