Web scraping for Data Science : Part I

In my previous article Life as a Data Scientist I highlighted the fact that data science is a multi-discipline department where you may end up playing different roles,especially in the absense of a data engineering team and being web scraper is one of such roles.I would say it’s always nice to have such skill in your armour as you might need that to explore data even for data exploration in own purpose.

One of the main challenge in web scraping is different websites provide data in different format and there is no one for all solution for data scraping from that.in this post I’ll talk about a particular method that helps download data in a dynamic fashion.By dynamic I mean to say pages where for every record you have to give some input or have to enter a password.It is a package called Selenium which has api available for different languages,such as RSelenium in R and Selenium for Ruby.This is mostly used in traditional web development testing,which is useful for our purpose as well.

It can be installed using pip with command

Below is one sample code that extracts bond rating for a particular city.This code shows the libraries exported for the purpose.

Below is the list of cities for which we want to download bond ratings

Below is the snapshot of a browser window with automated data entry during operation.

Browser window with automated data entry
Browser window with automated data entry

Following is the snapshot of the data that got written in target file foo.txt

On a nutshell this process can be made running in the background as it works interactively
with the help of a browser software.However it is slower compared to other web scraping methods
(Will be explained in part II ) as it is interactive in nature.