Web scraping in R using rVest

I am not much conversant with web scraping but I undersand the importance of the technique given the fact that a lot of very useful data is embedded in HTML pages. Hence I was very excited when I came across this blog post on rstudio site which introduced a new package called rvest for web scraping. The github repository of the package is here.

As an excersie in scraping web pages, I set out to get all the Exchange Traded Funds (ETF) data from London Stock Exchange web site.

First things first, load up the rvest package and set out the base url, a download location where the html will be saved. You can do this without having to download the file but there were same proxy setting in the environment I was working on which prevented me from doing this. So I opted to download the html, process it and then to delete it.

library("rvest")
url <- "http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html"
download_folder <- "C:/R/"
etf_table <- data.frame()

Next thing to determine will be how many pages are there in ETF table. If you visit the url you would find that just above the table where ETFs are displayed, there is string which will tell us how many pages there are. It was 38 when I was writing the script. If you look at the source html, this string appears in paragraph tag whose class is floatsx.

Time to call html_nodes to get the part of html with a paragraph with class floatsx and then run html_text to get the actual string. Then its a matter of taking a substring of complete string to get the number of pages.

#find how many pages are there
download.file(url,paste(download_folder,"ETFs.html",sep=""))
html <- html(paste(download_folder,"ETFs.html",sep=""))
pages <- html_text(html_node(html,"p.floatsx"))
pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))

Now that we know how many pages are there, we want to iterate over each page and get ETF values from the table. Again load up the html and we call html_nodes but this time we are looking at all the tables. On this page there is just one table which displays all the ETF rates. So we are only interested in the first table.

#for each page
for (p in 1:pages) {
 cur_url <- paste(url,"?&page=",p,sep="")
 #download the file
 download.file(cur_url,paste(download_folder,p,".html",sep=""))
 #create html object
 html <- html(paste(download_folder,p,".html",sep=""))
 #look for tables on the page and get the first one
 table <- html_table(html_nodes(html,"table")[[1]])
 #only first 6 columns contain information that we need
 table <- table[1:6]
 #stick a timestamp at end
 table["Timestamp"] <- Sys.time()
 #add into the final results table
 etf_table <- rbind(etf_table,table)
 #remove the originally downloaded file
 file.remove(paste(download_folder,p,".html",sep=""))

 #summary
 summary(etf_table)
}

As you can see, rvest makes scrapping web data extremly simple so give it a try.The markdown file and knitted html is available on github link below if you want to run it in your own environment.
Github link

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s