Month: January 2014

Introduction to R

In this post, I would like to give a brief intro to R. R has gained tremendous popularity in recent past and many people want to know just what R is without getting into too many details. So here it is.

What is R?

R is a programming environment for statistical analysis. It consists of a programming language, graphics capability to visualise data, interfaces to other languages and debugging environment. R is open source which means it is free to download, install and use under GNU General Public License licence. Although R refers to complete set of tools for data analysis, in this post R is being referred as R programming language.

What’s its history?

R originated from S language and has similarities with it. S language was developed in 1975 at Bell Laboratories as statistical programming language. Other variations of S were developed later on including New S and S-PLUS which are available today as commercial software. R was developed by Robert Gentleman and Ross Ihaka in 1997 at University of Auckland as a way of teaching S-PLUS. Since then new features are being continuously added to R.

Why use R?

There are many statistical analysis packages available in the market. Some notable examples include SAS, SPSS, STATISTICA, Stata, Minitab, Mathematica and MATLAB. Despite these, R has gained popularity among data mining and statisticians because of following advantages.

• R is free
• Data analysis in R is written like a computer program, hence unlike some of the point-and-click tools, the data analysis performed using R is repeatable and documented.
• Data analysis in R is easier to communicate compared to other statistical packages.
• R has excellent graphics and visualizations capabilities built-in. At the same time, new data visualizations techniques are being continuously added.
• All the standard statistical techniques and models are built right into R.
• R code can be extended and reused by creating packages. There are packages available for seemingly every analysis right from generating an R data analysis in Word, PDF to cutting edge machine learning algorithms.
• R has dedicated followers in academia and general data mining community. The community contributes new packages. There are more than 2000 packages available catering almost every need of R users. With such large and diverse community, help on R is always available.
• Being open source, R can integrate easily with other programming languages and platforms.

What R is not?
• R is not a GUI based data analysis tool. So for every data analysis, the user must write code and the code must be executed in sequence.
• R is not a general purpose programming language like C, C++, C# or Python. Although, many of the constructs from a general purpose programming languages are supported in R such as loops, functions, variables etc., it is specifically geared towards statistical analysis.
• R is not meant for guided analysis similar to OLAP or some of the other technologies such as Tableau, Quilview, PowerPivot etc.  In all these technologies, the user is presented with a dataset which he can slice/dice or filter. With R, the user is forced to think about the analysis beforehand since he has to write code for each step of analysis. This method has both pros and cons. One of the advantages is that the approach makes the user to understand the dataset beforehand a little more which makes the analysis structured. Whereas a disadvantage is that user might miss some obvious patterns in the dataset.

Who and where is R used?

With recent advancement in Big Data technologies, interest in R has increased significantly. While the list below is not exhaustive, some industries which most prominently use R include Media and Advertising, Finance, e-commerce, academia and bioinformatics.
Some of the applications of R include
1. Data mining
2. Recommender systems
3. Quantitative finance and automated trading
4. Predictive modelling
5. Statistical modelling

I know C#, is it similar?
Not really. R is a functional programming language. R can have a steep learning curve and expects some basic understanding of statistics.

Can you show me some examples of R programs and graphs?
There are many resources on the internet which will show various charts generated using R. The below links contain the chart and also the code used to generate the chart.
http://rgraphgallery.blogspot.co.uk/
http://learnr.wordpress.com/
http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html

Also check out related links below.

Where should I get started?
The best place to learn R is to download and play with it. Apart from that there are numerous beginner tutorials available on the internet to get you started.  If you want to try R without installing anything you can use tryR where R code can be executed in the browser window.

Related links

1. R language home page where you can download R binary packages and installable, The Comprehensive R Archive Network (CRAN) http://cran.r-project.org/

2. This is the place to ask for help on R. Please read posting guide before sending anything to mailing lists. R Mailing List page http://www.r-project.org/mail.html

3. Quick intro to R from CRAN itself. An introduction to R http://cran.r-project.org/doc/manuals/R-intro.pdf

4. Brief intro and quick reference on R language icebreaker notes http://www.ms.unimelb.edu.au/~andrewpr/r-users/

5. You can try R and learn here without installing anything. The code runs in browser. tryR http://tryr.codeschool.com/ :

6. Step by step tutorials . R-bootcamp http://jaredknowles.com/r-bootcamp/

7. Two minute video tutorials on R topics. Fun if you are short on time. Twotorials http://www.twotorials.com/

8. A good collection of videos on R from beginner to advanced users Video tutorial on R http://jeromyanglim.blogspot.co.uk/2010/05/videos-on-data-analysis-with-r.html

9. Contains links step by step to getting started with R Learning R blog http://jeromyanglim.blogspot.co.uk/2009/06/learning-r-for-researchers-in.html

10. A blog on techniques in R and latest happenings in R world R-bloggers http://www.r-bloggers.com/  Also of interest can be http://blog.revolutionanalytics.com/

11. This discussion on stackexchange contains link to various resources on R Stackexchange resources http://stats.stackexchange.com/questions/138/resources-for-learning-r

12. This is more comprehensive book on R Introduction to Probability and Statistics Using R http://cran.r-project.org/web/packages/IPSUR/vignettes/IPSUR.pdf

13. If you are into more structured learning, this is probably the best option to learn R. Coursera course on Data Science https://www.coursera.org/specialization/#jhudatascience/1?utm_medium=catalogSpec  The previous course videos are here http://www.r-bloggers.com/videos-from-courseras-four-week-course-in-r/

14. Wiki page explain what is functional programming Functional Programming http://en.wikipedia.org/wiki/Functional_programming

Large date dimensions in SSAS

I would like to share two simple tips on date dimension if the date range in your date dimension is large and uncertain.

A brief background

I was recently working on a project where I was faced with two situations.

1. The range of dates in the date dimension was large. Something from 1970’s to 2030 and beyond.

2. The source data can contain dates outside of this range i.e. before 1970 and after 2030 but we did not know for certain the exact range.

So here are the tips.

1. Add only required dates 

I could have just populated the date dimension with 200 years worth of dates i.e. 73000 records and be done with it. On closer inspection of source data, however, I found that data in fact table will be sparse for years outside of this range. There would be a few records which would have a date from 1955, a few from 1921 and so on. So why add those extra rows for year 1920 if the only date that is ever going to be used from this year is 01/02/1920.  Even for future dates, why bother adding all the dates from 2070 if ,now, the only date I need is, lets say, 23/09/2070.

To avoid fattening the date dimension, I created a static date range i.e. dates which are most often used. For dates outside of these I created a so-called ‘date sync’ mechanism. In a nutshell, all it does is at the end of dimension load and before the beginning of fact load, it goes through all the date fields in source tables (which are in staging or ODS by now) and makes sure that all dates are present in the date dimension. If they are not, it simply created a row for that particular day. It might seem a slow process but since the data is in relational engine by now, it is quite fast. Plus, it always makes sure that the date will always present in the date dimension so ETL won’t fail due to foreign key constraints.

2. Categories dates

So as mentioned before, our date range was wide so slicing and dicing using date hierarchy was painful because the year started from 1920′ till 2030 and beyond. To make browsing a little less problematic, we introduced a year category field. When we asked the business users, they were most interested in data from last 5 years to next 10 years. So we added a derived column which categorized the dates into various buckets like Pre-2008, 2008… and Post-2024.  We created an attribute based on this fields in date dimension and our date hierarchy looked like this.

DateCategory–>Year–>Month–>Date

Now, when the users dragged date hierarchy on rows in excel, they would see years before 2008 under Pre-2008 then all the years between 2008  and 2024 (which they were most interested in ) and then Post-2024. Nice and clean.

Hopefully these will be helpful to you in some way or might give you some better idea of handling large date dimension. If you have any suggestions, please feel free to drop me a line.