3 Ways To Scrape JavaScript Generated Data

Often you will come across sites that generate or request data using JavaScript to populate the tables and sometimes entire sites. So not having to deal with JavaScript is the best way of dealing with it. Dealing with JavaScript can be a pain when you need to scrape. Here are some ways you can go about doing just that.

Method 1 – Requests-html library

To achieve our aim, we will first use the library requests-html in python. First, let’s go over the things that this library offers.

  • Full JavaScript support
  • CSS Selectors.
  • XPath Selectors.
  • Fake user-agents.
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • Async Support

Despite this, it isn’t perfect. There will be sites where it will fail and you will have to use selenium or figure out a workaround. But mostly it works and does what it is supposed to do. Install this library by using pip install requests-html in your terminal or Command prompt and try this code.

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://python.org/')

r.html.render()

r.html.search('Python 2 will retire in only {months} months!')['months']

If you’ve ever used the requests library, you will notice this feels and works very similar to it. This transitions to requests_html much easier. On line 8 we make use of the data generated by JavaScript and print it out. Which is [‘months’].

This makes use of Chromium and will download it to your home directory the first time you run it.

Method 2 – Good ol’ selenium

If you need to scrape data generated by JavaScript, the first thing that might come to your mind is selenium. While selenium will always remain an option, it has to be the last option. The reason is simple. Performance. When compared to using web requests selenium will always perform much slower because it has to load many components that simulates a real browser because well, it is a real browser.

Controlling a complete browser means all the scripts and elements have to load. All you need is some implicit wait and you’re ready. Try not to perform too many actions at once while using selenium though specially if you have a weaker system. Also, since selenium isn’t thread-safe running multiple instances is difficult just a heads up.

Well, unless you improvise and use multiple virtual machines. It’s a solution that works but would require a lot of processing power and memory.

The time consumption on this solution is lowest since this a basically a inbuilt feature of all browser. The outcome can depend on what you are looking for. If you are scraping small to medium sites this will probably be fine for you.

from selenium import webdriver

driver = webdriver.Firefox()
driver.implicitly_wait(10) # seconds
driver.get("http://somedomain/url_that_delays_loading")
myDynamicElement = driver.find_element_by_id("myDynamicElement")

Method 3 – Simulate the JavaScript requests

If you simply know where to look you can grab the data directly without ever having to go through with dealing with JavaScript. A JS will send requests to grab the data it needs to push it over to our browser.

So, if we intercept this request using something like Fiddler or any other web debugging proxy, we can see exactly which requests the JS sends or the logic it performs.

Now its just a matter of finding the right requests or reversing that logic and we’re done. This is the best option to go with when you can.

Lets say website example.com generates some data to populate the site using a function within JavaScript. We could rebuild this function within our python script and use it whenever we want the data it generates.

The time consumption to implement this solution can get high at times, but the outcome is the best if you are trying to achieve result on mass.

Leave a Comment