Zen of data modelling in Hadoop

Zen of data modelling in Hadoop

The Zen of Python is well known tongue-in-cheek guidelines to writing Python code. If you haven’t read it, I would highly recommend reading it here Zen of Python.

There’s an “ester egg” in Python REPL. Type the following import into Python REPL and it will printout the Zen of Python.

import this

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -let’s do more of those!

I have been pondering about how would you model your data in Hadoop. There are no clear guideline or hard and fast rules. As in, if it is an OLTP database use 3NF or if you are building warehouse, use Kimball methodology. I had the same dilemma when SSAS Tabular came around in SQL Server 2012. The conclusion that I reached was that it was somewhere between the two extremes: a fully normalized > = 3NF and flat-wide-deep denormalized mode.
Hadoop ecosystem also throws additional complexities in the mix – what file format to choose (Avro, Paraquet, ORC, Sequence and so on), what compression works best (Snappy, bzip etc), access patterns (is it scan, is it more like search), redundancy, data distribution,partitioning, distributed/parallel processing and so on. Some would give the golden answer “it depends”. However, I think there are some basic guidelines we can follow (while, at the same time, not disagreeing with the fact that it depends on the use case).

So in the same vein as  Zen of Python I would like to present my take on Zen of data modelling in Hadoop. In future posts – hopefully – I will expand further on each one.
So here it is

Zen of Data Modelling in Hadoop

Scans are better than joins
Distributed is better than centralised
Tabular view is better than nested
Redundancy is better than lookups
Compact is better than sparse
End user queries matter
No model is appropriate for all use cases so schema changes be graceful
Multiple views of the data are okay
Access to all data is not always needed, filtering at the source is great
Processing is better done closer to data
Chunk data into large blocks
ETL is not the end, so don’t get hung up on it, ETL  exists to serve data
Good enough is better than none at all, Better is acceptable than Best

So there you have it. As mentioned previously, the model will depend on the situation and lets hope these will be helpful starting point. They come with full disclaimer – Use at your own risk and are always subject to change 🙂

Data Virtualization in Business Intelligence

A long time ago I wrote a blog post where I described three approaches to providing operational reports and compared them against each other. They are 1) traditional DW approach, 2) the abstract views approach and 3) the direct query approach. The blog post can be found here and I keep it handy whenever I am asked about operational reports.

I have been recently looking at Data Virtualization (DV) and started thinking how it can be used in a BI project. This blog is about that. If you are not familiar with DV, this youtube video from one of the DV software vendors provides good introduction. I would highly recommend watching it before proceeding.

This is the wikipedia definition of DV but in a nutshell it is a technique in which multiple data sources are joined together in a logical layer so as to abstract the complexities of the data sources and provide a unified view of the data to the end users. Going back to my original blog post, one of the difficulties of the views approach is in integrating multiple data sources. That’s primarily because views are simply projections on tables in databases. It’s quite difficult to create a view which spans across different databases for e.g. SQL Server view based on data in Oracle unless you some how bring it in the SQL server AKA ETL it. It is even more challenging if the data source is a flat file or XML or JSON. How would you create views on them?

Enter DV. In DV, as mentioned earlier, multiple data sources are joined together in a virtual layer. The type of the data sources can vary from relational databases to files in Hadoop to web services. The ETL is performed on the fly and in-place,if need be, i.e. multiple data sources are integrated in real time. If we can point our reporting tool to this virtual layer, we can provide real time operational reporting. Sweet!! So imagine a quick report you want to knock together which involves SQL Server, a web service and Excel file. DV will gladly connect to three, allow you mash them together and publish this data. You can point the reporting tool to published data and are good to go. And all in real time – that’s added bonus.

This all looks very cool hence a word of caution. It sounds like DV is replacement for ETL and data warehouse but it is NOT. There are far more advantages to having a DW and ETL process in place but they are beyond the scope of this blog. We can certainly think of creative ways of using DV in conjunction with traditional DW and ETL. Providing real time operational reports in a way mentioned above can be one of them.

99 Problems in R

In my Introduction to R post, I introduced R and provided some resources to learn it. I am learning R myself and finding the learning curve a bit steep. Anyway, the best way to learn a new programming language is to practice as much as possible. So inspired by 99 Problems in various languages (links below), I am creating ’99 Problems in R’ set. The project is on github. I am new to github but finding it easy to share code through github rather than here on the blog. Hopefully in future, I would make more use of github.
The files are in *.rmd format which can be opened in R Studio. I have also added knitted HTML files. The git repo is here.
Be warned that the solutions to problems are written by me; an amateur R programmer, so there might be better way of solving some of them. I will try to solve more problems and keep adding them to the repo. In the mean time, feel free to do pull request and peek at code.

I wish I can say ‘I got 99 problems but R ain’t one’ but alas I am not there yet. 🙂
99 Haskell Problems 

99 Python Problems

99 Prolog Problems

99 LISP Problems

99 Perl 6 Problems

99 OCaml Problems

Displaying last cube processed time in SSAS

Recently, while browsing a cube excel, an end-user asked me when was it last processed? I had no quick answer and so had to connect to the SSAS server in management studio and check the properties of the cube. This got me thinking on how to expose the cube’s last processed date and time to end users through Excel or similar client tools. This is very helpful to them in that they can quickly check when was the data last added. Think about it like this. A user builds a pivot table report in Excel with a filter at top which shows when was the cube processed. He then copies the pivot table data and sends to his colleague. His colleague checks the numbers and instantly knows that they are based on data from certain date. If he did not have the last processed date, the assumption can be made that the numbers are based on latest set of data which might not be true. So with that in mind, I decided to implement a simple solution to expose when the cube was last processed and here are the steps.

1. Create table in data warehouse which would have just one column storing date and time when the cube processing finished. This table will always have one record.

2. The record in the table above will be updated with latest timestamp after the cube has been processed. This can be done in the ETL.

3. Import this table in the DSV and create a dimension based on it. This dimension will have just one attribute and one member which will the the last time cube was processed. You might want to be careful with naming of this dimension. It should appear either at the top or bottom in fields list in pivot table so that it does not interfere with the analysis.

4. Set IsAggregatable property of the attribute to False. This is to remove the ‘All’ member because there’s always going to be one member in the dimension attributes,

5. Add this dimension in Dimension Usage matrix. You don’t have to worry about attaching it to any measure group.

6. Add the new dimension to the perspectives if you have any and that’s it.Deploy and process the cube.

To test, connect to the cube in Excel, drag the attribute in report filter and it should show when was the cube last processed. Here’s a little screenshot when I implemented this for Adventure Works 2012 cube. And yes, don’t forget to process the dimension while processing the SSAS database.


Filter shows when the cube was last processed

Hope that was helpful. Any suggestions are always welcome.

Welcome SQL Server 2014!!

I have been away from my blog for a while. Although, I try to write something every now and then, I just couldn’t bring my self to write anything this past month. I am hoping I will get into regular writing habit from now on.

Anyhow, you all must have heard about the big news this week about SQL Server. SQL Server 2014 was announced at Microsoft TechEd 2013 keynote. I am super excited and I am sure you all will be after looking at the new features. I am just about getting the hang of SQL Server 2012 and we already have new and exciting stuff to look forward to in SQL Server 2014. Things are really moving fast. SQL Server 2014 will introduce a number of new features. Oh!! and did I mention how relieved I was to see that nothing major has changed in terms of SSAS, SSIS and SSRS!!

At the end of the post you will find links to some good blogs and whitepapers about SQL Server 2014. For now, I want to quickly note features which I am excited about.

Project Hekaton Or SQL Server In-Memory OLTP

Microsoft is introducing In-Memory processing with SQL Server 2014. If I understand correctly, what this entails is keeping high frequency or most accessed tables in memory thereby saving the trip to the disk. The new mechanism also saves enormous amount of effort SQL Server has to spend on locking and latching. Part of SQL Server relational engine have been changed to take into consideration the In-Memory tables. For e.g. Query Optimiser can now take into consideration that the table is now completely in memory and can choose a plan based on this cost rather than making an explicit assumption that all the data is on disk.

There are some restrictions though like you cannot have row size bigger than 8060 Bytes, foreign key constraints are not supported, LOB data types are not supported, the newly introduced natively compiled stored procedures can only access In-Memory tables etc. Despite these, I think, this feature will be quite powerful. Not to mention as it gets new improvements over next version on SQL Server this feature can be really useful.

Although, In-Memory is being touted as OLTP feature, I think it will be useful in DW/ETL scenarios as well. I guess we would have to wait and watch how we can use it.

Updatable Columnstore Indexes

One of the growing the complaints with SQL Server 2012 Columnstore Indexes have been that they cannot be updated. They need to be rebuilt after the data in the underlying table has changed. Well no more!! SQL Server 2014 introduces updatable columnstore indexes.

This is probably going to be the best feature related to BI in SQL Server 2014. Updatable columnstore index with Tabular model, direct query in tabular model or ROLAP in  multidimensional model can be the killer  combination.

SSD as Cache Store

With SQL Server 2014, you can specify SSD as an extension to memory. This way SQL Server will cache frequently accessed data. Since SSD’s are way faster than disks, the performance will be improved. It’s almost as if the SSD will act as added buffer pool memory for your SQL Server. I think the In-memory and SSD feature will go hand in hand.

Online indexing at partition level and Partition level stats

As the name suggest index can now be built online at partition level and stats are maintained at partition level. I suppose the second part would lead to better execution plan and the first one to lesser downtime when indexing a large table.

Some of the other improvements include

– Unlike  SQL Server 2012, SQL Server 2014 supports 8 secondaries in Always On.

– There are some improvements/changes in Resource Governor which, to be honest, I didn’t really grasp

– SQL Server backups can directly be taken to Azure data store and SQL Server files can also be directly stored in Azure storage.

Please find below links to some of the interesting blogs and whitepapers on SQL Server 2014.

SQL Server 2014 Home Page : You can sign up here to get notified when the trial edition is available.

SQL Server 2014: Unlocking Real-Time Insights

SQL Server 2014 : A Closer Look 

(Almost) Everything You Need to Know About SQL Server 2014

My 2 cents on SQL Server 2014

Geek City: SQL Server 2014 In-Memory OLTP (“Hekaton”) Whitepaper: This blog has link to the Hekaton whitepaper. It’s an excellant read to get an idea about how in-memory is going to work. I will highly recommend it.

SQL Server 2014 Faster Insights Whitepaper

SQL Server Mission Critical Performance White Paper

Microsoft Data Explorer for Excel 2010/2013

Chris Webb posted about Data Explorer being available for public preview and I thought I will give it a quick spin. You can download it here . The installation is pretty straightforward. After the installation, when you open Excel (I am using Excel 2010), you would see a new Data Explorer tab like below.

Data Explorer Tab in Excel 2010

Data Explorer Tab in Excel 2010

Data Explorer can import data from a number of sources including Facebook (How cool is that!!!). The complete list of sources and how-to is here. For the purpose of this small exercise, I am using data from Lending Club. It is available to download from their website.
Once the data source is et up and connected, new query screen shows up and this is where many of the transformations can be done. I am going to list few here which can be of use to general users.

1. Change Type of column
By default the data type of the column is set to Auto-Detect. You can change it to your specific data type by right-clicking on the column and selecting the type from Change Type option.

Change Data Type of Column in Data Explorer

Change Data Type of Column in Data Explorer

2. Transformations
Similar to changing the data type, you can do some simple transformations on the columns. To do this, right click on the column in the query window and select Transform. The options in Transform menu will depend on the data type of the column. For numeric columns, this option is greyed out.

Data Transform Options in Data Explorer

Data Transform Options in Data Explorer

3. Adding a new column
New columns can be very easily added and formulas can be used. The complete formula reference is listed on Office help page.
For example, in the below screenshot I am adding a new column which will take the lower range of FICO Score (FICO score is like credit score in US).

Add a new column in Data Explorer

Add a new column in Data Explorer

4. Undo transform
One of the nice features was the ability to Undo a transform (oh!! how I miss the undo button in SSIS). On the left hand side, you would see Steps section. If you expand that, it will list all the steps you have taken so far to transform data. Clicking the X against it will undo the transform you did in that step. If you try to remove a step other than the last step, you get a warning.
Some other handy transforms are Split, Replace Values and Group By and I think they are self explanatory.

I did face an error though while doing some transformations. The file mentioned above contained 11428 records. After doing some transformations, I was trying to save the Query and get the data in a workbook, at which point I was persistently getting the following error. Not to mention, I sent a lot of frowns to MS. What worked was, using a sample of rows to do transformations and then import all in the workbook.

Error in Data Explorer After saving query

Error in Data Explorer After saving query

All in all, it’s a nice utility to do some simple transformations and combine data from different sources in excel. Soon I am planning to take look at how it works with Facebook data.

Tableau from a SQL Server guy – Part 2

As I mentioned in my previous post, I have been playing with Tableau Desktop in my spare time. I am just experimenting with it and there are no structured instructions that I am following. In this post, I intend to connect to Excel data source and do some analysis. I chose the sample excel provided by Tableau which is mostly located in “My Documents\My Tableau Repository\Datasources\Sample – Superstore Sales (Excel).xls”. Once connected, Tableau will try to create Dimensions and Measures for you. It seems like, all the numeric field are converted to measures and string fields are converted to dimensions. To test my theory, I added two columns in the source excel. One called Sales Category with string values and other Weight with numeric values. Sure enough Sales Category became a dimension and Weight became a measure.

Strings are converted to dimensions and numeric values are converted to measures

Strings are converted to dimensions and numeric values are converted to measures

You will also observe a small icon beside each field. The icon kind of represents the data type. For e.g. ‘Abc’ means it’s a character data, a # represents numeric field. There are certain special icons. For e.g. a little globe means the field has something to do with geography data, small square with clock is date/time field, that chart like thing is a bin(more on this later) and pin is Group (again more on that later). You can change data type by right clicking on the field  and selecting Change Data Type. You can move stuff from measures to dimensions and vice a versa (of course if it makes sense!). For e.g. Order Id in measures does not make much sense, you can drag it to dimension panel and it will become a dimension. My gut feeling is that measure is treated just like another dimension, something similar to what SQL Server 2012 Tabular does. Creating a hierarchy is simple enough. Right click on a field from dimension panel and select create hierarchy. Drag the fields under the hierarchy name.



Coming to the bins and groups part mentioned above. I was a bit puzzled and struggled to get my head around the concept. Once I got hang of it, its is one of my favourite features. Bins are like buckets of data or data groups. For e.g you have a persons age as measure, you can group it in bins of 0-10,10-20,20-30 and so on or as ‘Kids’,’Adults’ etc. This tutorial is a good starting point.

That’s it for this post. In the next post, I am going to try to answer some questions based on some interesting datasets I have found using Tableau. So stay tuned!!