Web scraping for Data Science : Part I

In my previous article Life as a Data Scientist I highlighted the fact that data science is a multi-discipline department where you may end up playing different roles,especially in the absense of a data engineering team and being web scraper is one of such roles.I would say it’s always nice to have such skill in your armour as you might need that to explore data even for data exploration in own purpose.

One of the main challenge in web scraping is different websites provide data in different format and there is no one for all solution for data scraping from that.in this post I’ll talk about a particular method that helps download data in a dynamic fashion.By dynamic I mean to say pages where for every record you have to give some input or have to enter a password.It is a package called Selenium which has api available for different languages,such as RSelenium in R and Selenium for Ruby.This is mostly used in traditional web development testing,which is useful for our purpose as well.

It can be installed using pip with command

Below is one sample code that extracts bond rating for a particular city.This code shows the libraries exported for the purpose.

Below is the list of cities for which we want to download bond ratings

Below is the snapshot of a browser window with automated data entry during operation.

Browser window with automated data entry
Browser window with automated data entry

Following is the snapshot of the data that got written in target file foo.txt

On a nutshell this process can be made running in the background as it works interactively
with the help of a browser software.However it is slower compared to other web scraping methods
(Will be explained in part II ) as it is interactive in nature.

How to adjust the screen size of Ubuntu guest in vmplayer

I’ve been a Ubuntu fan for long time and for obvious reason in the initial days a desire to use both Windows and Ubuntu in a single PC/laptop lead me to use vmplayer.
I started with putting Ubuntu 14.04 as guest OS inside vmplayer and keeping Windows 7 as host.This lead to some nagging problem like sharing folders between host and guest,network sharing and also not detecting the host screen size.Eventually I resolved the other errors but the screen size adaptability issue persisted.So after some research I found a way to fix it without using any additional software.Below are the steps.

Following is the result of first command and its output.

Then you can see which output is connected,here Virtual1 is connected. Then guess a good resolution for your screen.For me I guessed following configuration.

Eg:cvt horizontal length vertical length refresh rate

The output is

Then you need to copy the text after “Modelline” and paste that in following command after “newmode” like the example below

Then take the screen resolution details from the quote and use it in next command.For me the command is

The next command to set it in ubuntu for me is

If the last command doesn’t work for you,you can go back to ubuntu display gui setting and can chose the display you have added just now.For me the option was 1904 x 1070 (16:9)

However if you are fine with this experiment and have found the desired screen resolution and to make that permanent,write all the last 3 commands starting from xrandr –newmode command in your .xprofile file using following command and then save,exit and restart your ubuntu vm and you will get your desired resolution permanently.

Life as a Data Scientist!

Well,I keep getting emails or linkedin messages asking what I do as part of my daily life as a data scientist.Being a lazy writer I avoid answering in detail,although I know that there are lot to tell actually.

Is it something like this? Not actually,but yes at times!

 Is it something like this? well,quite likely!Just like avengers you may end up becoming superheros in many different fields,or just like an actor you get to live the life of different characters.You can lead the life of a hacker,a journalist,a scientist,a business analyst,a developer,a miner,a purchase manager,an artist and may be the life of a celebrity.

Let me explain why it is so.If you read any internet article or white paper that will tell you data science is all about 70-80% data manipulation and 30-20% machine learning.So it is obvious that you should be doing data munging and machine learning (model building) a lot.I will try to tell what it doesn’t say.

 Data munging purely depends on from where the data comes,if your data source is external (such as 3rd party website,for which your company isn’t ready to spend a dime) you have to know web scrapping.Web scrapping can be of different type.In some cases the website owner will be kind enough to make the data available using GET protocol itself.Others may not be that generous and those are the cases where you may have to use python packages like suds (which may break because of lack of maintenance and can come up with different github fork by a fan,but that’s another story).The website might be having the data in a interactive manner in which case packages like selenium will be your saving grace.Also different webapi,text extraction packages will be useful at times depending on the nature of the website.So you really have to be a hacker (white-hat obviously) in your heart for this!

Yes,you get to lead the life of a jurno as well.Most of the analytic projects nowadays involve having interviews with people who have done related stuff.So don’t be astonished if you have to arrange such interviews with some eminent professors or researchers and if you literally have to take notes during such hour long interviews.

 You will lead the life of researcher quite often.Cases like these can occur in different situations when you realize that a minor improvement in a recent research paper on noise removal from data could be useful in your project or coming out with a novel text categorization nlp work or innovating a feature selection mechanism that could be only useful for your particular project or creating altogether a brand new award wining classifier/regressor like xgboost.So yes,the Scientist tag in Data Scientist is there for a reason.Although it varies from project to project, but 5-15% of total time should be a fair estimate.

 Unfortunately a large chunk of time you’d have to spend as part of your job will be sitting with your clients and fellow business analyst,virtually playing the role of another business analyst to understand their requirement.They will often have a notion that a data scientist is nothing other than a superman fortune teller who should’t even need historical data to predict future result.Often you may end up trying really hard to convince the fellow business analyst that he is actually not a data scientist and it’s rather your job.A strong coffee at times might be of your real help during those long meetings in this role playing!

 Well,you have to become a heck of a programmer to deal with different type of data in different granularity to put them in a 2D format on which machine learning algorithms (classification,regression or clustering) can actually work.If you aren’t from computer science background you’ll generally start with R for its ease of use,but gradually you will make friendship with python for different reasons.After few months with these languages in your armor when you’ll start feeling safe and secure,one fine morning a colleague of yours will tell you that,we must learn Scala for some other work.To irritate you further some of your friends will tell you about advent of faster languages on the horizon such as Julia ,Go and F#.You may feel clueless at that time,but I’ll leave you to deal with your haplessness yourself.Probably it’s the time when you’ll realize that just like a pretty woman a beautiful programming language also comes in your life with her own imperfection.

 If that is not all big data gives you the feeling of a data miner with its vastness.You make yourself,your boss and your IT guys happy as long as you remain composed with traditional big data mining tools such as hive and pig.However to make the matter worse that colleague and some of your friend will keep telling you how awesome the latest apache project is and how fast it will disrupt the existing one.You’ll get happy when you’ll learn apache spark for its speed but months later it’ll give you headache when you’ll learn that you have to unlearn that for another awesome technology.Eventually you’ll forward this demand to you IT setup guys to make their life equally challenging.

 Yes,you get to play the role of a purchase manager as well when you have to be present in regular demo sessions arranged by your company where sales persons from different big data/machine learning product companies will try to impress you with the awesomeness of their tools.It’s altogether a different case when at the end of the day while using that tool you’ll realize that you are doing more bug reporting than actually using the tool successfully.

 Needless to say the hidden artist inside you,who used to draw crappy paintings in school or rather worse looking replica of his teenage girlfriend will finally get to use his artistic sense in real commercial place.Your boss will keep pushing you to create state of the art visualizations using Tableu,Trifacta,Oracle BDD or D3 only to cause a havoc confusion inside you “What is the expectation from me,am I a programmer,a data expert or an artist !!”.

 Considering all these the best part of being a data scientist is often you get the treatment of a celebrity when you log in to your linkedin profile.The sheer number of messages from recruiters trying to pull/place you in a different company will definitely give you the feeling of a celebrity.On top of that you might end up getting invitation from data science based conferences to grace them with your presence as speaker! So typical celeb life nonetheless.

 At the end, it will be unfair to not mention wannabe data scientists and their numerous questions that will inspire you to write such blogs which should once and for all relieve you from answering such questions requiring detailed answers.

String distance based fuzzy matching in R

Approximate string matching (aka fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching is typically divided into two sub-problems: finding approximate sub string matches inside a given string and finding dictionary strings that match the pattern approximately.Here I’ll show an implementation of the second approach.

In data science we often encounter data set that is not coming from a standard database rather from a file.Worse if the file is typed,just what we often see in insurance industry.There are situations when we have no option other than joining this kind of data set with our existing data set and not with a numeric field,rather with a text field.For example joining with the help of company name.


The non-standard data set looks quite similar to this example.I use R for doing this kind of data standardization and in my experience I often encounter a situation when I don’t have much control on the agrep function present in R base package,that works with Levenshtein edit distance.It has one tuning parameter max and a value 1 of that will look out for very close match when a value 3 will bring out many non-related values as matched value.For example with max=3 it can pick up ABC Company as the matching value of ABC Corp however it will pick up Andrew’s BroadCast Corp as well.Hence we need a second string distanced based function that can work on the result of the first one and can give the relevant result,as we can see here the string distance between ABC Corp and ABC Company will be much higher that the one between ABC Corp and Andrew’s BroadCast Corp.
My approach here is to standardize the non-standard company name field,so that I can use it for fuzzy joining with the standard company name column.

I used stringdistmatrix function (present in stringdist) for this purpose and below is the code.

As described the code is searching the non-standard company name column in the data frame for matching entries and replacing the non-standard names with the first matching column.For example in the example
the first entry was ABC Corp and hence it will find matches with other entries in the same column it will replace that with ABC Corp and as the for loop traverses forward the company name column will only have standard entries.As the file was not large,I traversed the whole column,otherwise traversing half of the column should be sufficient as the column will get standardized by the time.

Analysis of top 100 chess players

In this analysis I have studied the age of the top 100 players and found an interesting pattern out of that.The top 100 positions which are considered one of the most toughest position attained by super intellectual people around the world seems to be dominated by people of 20-30 age range,with a median at 25 (the present number 1 player being of 23 years age).Also the distribution seems to be truly normal distribution with the shape of a bell curve.


Age based analysis

Another birthday based study revealed a hidden pattern of astrological signs among the chess players.In the top 100 players Sagittarius sign has the highest number of people(14),followed by the signs of Gemini,Cancer and Leo block.Also interestingly all 3 adjacent signs such as Capricorn,Aquarius and Pisces seem to have same number of people in the population.However till first 50 ranking cancer sign had the most number of people (7) which could also point to some hidden pattern.More research will follow this and will be published here.

Age based analysis

Here is the supporting data

Supporting data