R

Web scraping in R using rVest

I am not much conversant with web scraping but I undersand the importance of the technique given the fact that a lot of very useful data is embedded in HTML pages. Hence I was very excited when I came across this blog post on rstudio site which introduced a new package called rvest for web scraping. The github repository of the package is here.

As an excersie in scraping web pages, I set out to get all the Exchange Traded Funds (ETF) data from London Stock Exchange web site.

First things first, load up the rvest package and set out the base url, a download location where the html will be saved. You can do this without having to download the file but there were same proxy setting in the environment I was working on which prevented me from doing this. So I opted to download the html, process it and then to delete it.

library("rvest")
url <- "http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html"
download_folder <- "C:/R/"
etf_table <- data.frame()

Next thing to determine will be how many pages are there in ETF table. If you visit the url you would find that just above the table where ETFs are displayed, there is string which will tell us how many pages there are. It was 38 when I was writing the script. If you look at the source html, this string appears in paragraph tag whose class is floatsx.

Time to call html_nodes to get the part of html with a paragraph with class floatsx and then run html_text to get the actual string. Then its a matter of taking a substring of complete string to get the number of pages.

#find how many pages are there
download.file(url,paste(download_folder,"ETFs.html",sep=""))
html <- html(paste(download_folder,"ETFs.html",sep=""))
pages <- html_text(html_node(html,"p.floatsx"))
pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))

Now that we know how many pages are there, we want to iterate over each page and get ETF values from the table. Again load up the html and we call html_nodes but this time we are looking at all the tables. On this page there is just one table which displays all the ETF rates. So we are only interested in the first table.

#for each page
for (p in 1:pages) {
 cur_url <- paste(url,"?&page=",p,sep="")
 #download the file
 download.file(cur_url,paste(download_folder,p,".html",sep=""))
 #create html object
 html <- html(paste(download_folder,p,".html",sep=""))
 #look for tables on the page and get the first one
 table <- html_table(html_nodes(html,"table")[[1]])
 #only first 6 columns contain information that we need
 table <- table[1:6]
 #stick a timestamp at end
 table["Timestamp"] <- Sys.time()
 #add into the final results table
 etf_table <- rbind(etf_table,table)
 #remove the originally downloaded file
 file.remove(paste(download_folder,p,".html",sep=""))

 #summary
 summary(etf_table)
}

As you can see, rvest makes scrapping web data extremly simple so give it a try.The markdown file and knitted html is available on github link below if you want to run it in your own environment.
Github link

99 Problems in R

In my Introduction to R post, I introduced R and provided some resources to learn it. I am learning R myself and finding the learning curve a bit steep. Anyway, the best way to learn a new programming language is to practice as much as possible. So inspired by 99 Problems in various languages (links below), I am creating ’99 Problems in R’ set. The project is on github. I am new to github but finding it easy to share code through github rather than here on the blog. Hopefully in future, I would make more use of github.
The files are in *.rmd format which can be opened in R Studio. I have also added knitted HTML files. The git repo is here.
https://github.com/saysmymind/99-Problems-R
Be warned that the solutions to problems are written by me; an amateur R programmer, so there might be better way of solving some of them. I will try to solve more problems and keep adding them to the repo. In the mean time, feel free to do pull request and peek at code.

I wish I can say ‘I got 99 problems but R ain’t one’ but alas I am not there yet. 🙂
99 Haskell Problems 

99 Python Problems

99 Prolog Problems

99 LISP Problems

99 Perl 6 Problems

99 OCaml Problems

Introduction to R

In this post, I would like to give a brief intro to R. R has gained tremendous popularity in recent past and many people want to know just what R is without getting into too many details. So here it is.

What is R?

R is a programming environment for statistical analysis. It consists of a programming language, graphics capability to visualise data, interfaces to other languages and debugging environment. R is open source which means it is free to download, install and use under GNU General Public License licence. Although R refers to complete set of tools for data analysis, in this post R is being referred as R programming language.

What’s its history?

R originated from S language and has similarities with it. S language was developed in 1975 at Bell Laboratories as statistical programming language. Other variations of S were developed later on including New S and S-PLUS which are available today as commercial software. R was developed by Robert Gentleman and Ross Ihaka in 1997 at University of Auckland as a way of teaching S-PLUS. Since then new features are being continuously added to R.

Why use R?

There are many statistical analysis packages available in the market. Some notable examples include SAS, SPSS, STATISTICA, Stata, Minitab, Mathematica and MATLAB. Despite these, R has gained popularity among data mining and statisticians because of following advantages.

• R is free
• Data analysis in R is written like a computer program, hence unlike some of the point-and-click tools, the data analysis performed using R is repeatable and documented.
• Data analysis in R is easier to communicate compared to other statistical packages.
• R has excellent graphics and visualizations capabilities built-in. At the same time, new data visualizations techniques are being continuously added.
• All the standard statistical techniques and models are built right into R.
• R code can be extended and reused by creating packages. There are packages available for seemingly every analysis right from generating an R data analysis in Word, PDF to cutting edge machine learning algorithms.
• R has dedicated followers in academia and general data mining community. The community contributes new packages. There are more than 2000 packages available catering almost every need of R users. With such large and diverse community, help on R is always available.
• Being open source, R can integrate easily with other programming languages and platforms.

What R is not?
• R is not a GUI based data analysis tool. So for every data analysis, the user must write code and the code must be executed in sequence.
• R is not a general purpose programming language like C, C++, C# or Python. Although, many of the constructs from a general purpose programming languages are supported in R such as loops, functions, variables etc., it is specifically geared towards statistical analysis.
• R is not meant for guided analysis similar to OLAP or some of the other technologies such as Tableau, Quilview, PowerPivot etc.  In all these technologies, the user is presented with a dataset which he can slice/dice or filter. With R, the user is forced to think about the analysis beforehand since he has to write code for each step of analysis. This method has both pros and cons. One of the advantages is that the approach makes the user to understand the dataset beforehand a little more which makes the analysis structured. Whereas a disadvantage is that user might miss some obvious patterns in the dataset.

Who and where is R used?

With recent advancement in Big Data technologies, interest in R has increased significantly. While the list below is not exhaustive, some industries which most prominently use R include Media and Advertising, Finance, e-commerce, academia and bioinformatics.
Some of the applications of R include
1. Data mining
2. Recommender systems
3. Quantitative finance and automated trading
4. Predictive modelling
5. Statistical modelling

I know C#, is it similar?
Not really. R is a functional programming language. R can have a steep learning curve and expects some basic understanding of statistics.

Can you show me some examples of R programs and graphs?
There are many resources on the internet which will show various charts generated using R. The below links contain the chart and also the code used to generate the chart.
http://rgraphgallery.blogspot.co.uk/
http://learnr.wordpress.com/
http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html

Also check out related links below.

Where should I get started?
The best place to learn R is to download and play with it. Apart from that there are numerous beginner tutorials available on the internet to get you started.  If you want to try R without installing anything you can use tryR where R code can be executed in the browser window.

Related links

1. R language home page where you can download R binary packages and installable, The Comprehensive R Archive Network (CRAN) http://cran.r-project.org/

2. This is the place to ask for help on R. Please read posting guide before sending anything to mailing lists. R Mailing List page http://www.r-project.org/mail.html

3. Quick intro to R from CRAN itself. An introduction to R http://cran.r-project.org/doc/manuals/R-intro.pdf

4. Brief intro and quick reference on R language icebreaker notes http://www.ms.unimelb.edu.au/~andrewpr/r-users/

5. You can try R and learn here without installing anything. The code runs in browser. tryR http://tryr.codeschool.com/ :

6. Step by step tutorials . R-bootcamp http://jaredknowles.com/r-bootcamp/

7. Two minute video tutorials on R topics. Fun if you are short on time. Twotorials http://www.twotorials.com/

8. A good collection of videos on R from beginner to advanced users Video tutorial on R http://jeromyanglim.blogspot.co.uk/2010/05/videos-on-data-analysis-with-r.html

9. Contains links step by step to getting started with R Learning R blog http://jeromyanglim.blogspot.co.uk/2009/06/learning-r-for-researchers-in.html

10. A blog on techniques in R and latest happenings in R world R-bloggers http://www.r-bloggers.com/  Also of interest can be http://blog.revolutionanalytics.com/

11. This discussion on stackexchange contains link to various resources on R Stackexchange resources http://stats.stackexchange.com/questions/138/resources-for-learning-r

12. This is more comprehensive book on R Introduction to Probability and Statistics Using R http://cran.r-project.org/web/packages/IPSUR/vignettes/IPSUR.pdf

13. If you are into more structured learning, this is probably the best option to learn R. Coursera course on Data Science https://www.coursera.org/specialization/#jhudatascience/1?utm_medium=catalogSpec  The previous course videos are here http://www.r-bloggers.com/videos-from-courseras-four-week-course-in-r/

14. Wiki page explain what is functional programming Functional Programming http://en.wikipedia.org/wiki/Functional_programming

Sieve of Eratosthenes in R and SQL

I just completed Computing for Data Analysis on Coursera. The course is brief introduction to R programming language. R has been around for years but is gaining much attention recently due to Big Data and Data Science trends. I had an idea about R and the course offered a wonderful opportunity to learn in a systematic manner. So for some more practice and a bit of fun (ok, I admit, more for fun than practice), I decided to implement ‘Sieve of Eratosthenes’ in R and SQL and see which one is faster (because that’s what you do on a lazy Saturday!!) This is a method to find primes up to a given number. The R code looks like this.


getPrimeNumTilln <- function(n) {

# a holds a sequence of numbers from 2 to n
a <- c(2:n)
# we start from 2 since it is the beginning of prime numbers,
# it is also the loop varibale
l <- 2
# r this vector holds the results
r <- c()
#while the square of loop variable is less than n
while (l*l < n) {
# the prime number is the first element in vector a
# for e.g. in first iteration it will be 2
r <- c(r,a[1])

# remove elements from a which are multiples of l
# for e.g. in first iteration it will remove 2,4,6,8…
a <- a[-(which(a %% l ==0))]

# the loop varibale if the first variable in remaining a
# for e.g after first iteration, it will be 3, then 5 (since 4 has been removed)…
l <- a[1]
}
# the result is r and all the remaining elements in a
c(r,a)
}

And the SQL code looks like this.

DECLARE @maxnum INT = 1000 /* The number under which you want to find primes*/

DECLARE @l INT = 2 /* Beginning of prime numbers */

DECLARE @table TABLE (a INT NOT NULL PRIMARY KEY) /* Holding table*/

;WITH ct /* Generate the sequence of numbers*/
AS (
SELECT 2 AS id

UNION ALL

SELECT id + 1
FROM ct
WHERE id < @maxnum
)

INSERT INTO @table
SELECT id
FROM ct
OPTION (MAXRECURSION 0)

WHILE (@l * @l < @maxnum)
BEGIN
/*remove records which are divisible by l*/
DELETE
FROM @table
WHERE a != @l
AND (a % @l) = 0

/* the first remaining number is the prime number */
SELECT @l = MIN(a)
FROM @table
WHERE a > @l
END

SELECT COUNT(*)
FROM @table
Now, I am no expert in either maths or algorithms but that looks neat. To validate that the results are right, I ran it to check how many prime numbers are under 1000000 and both returned 78948. Wolfram Alpha seems to agree.

For smaller up to  10k, the results are comparable; they are in milliseconds. Above that,however,R seems to have an upper hand.

n R SQL
100000 0.02 1.72
1000000 0.56 19.00
10000000 11.25 246.21

Please note that the time is in seconds. The difference is quite stark especially for large n. R is killing SQL.

A few observations on SQL side are

  1. A significant time is being spent on generating sequence. With SQL Server 2012 Sequences, we might be able improve time.
  2. The delete operation is quite slow as we all know. I tried replacing it with update but that made it worse.

I know that there are many improvements that can be made to this but I am happy with my little testing. As usual comments are always welcome.