Zen of data modelling in Hadoop

Zen of data modelling in Hadoop

The Zen of Python is well known tongue-in-cheek guidelines to writing Python code. If you haven’t read it, I would highly recommend reading it here Zen of Python.

There’s an “ester egg” in Python REPL. Type the following import into Python REPL and it will printout the Zen of Python.

import this

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -let’s do more of those!

I have been pondering about how would you model your data in Hadoop. There are no clear guideline or hard and fast rules. As in, if it is an OLTP database use 3NF or if you are building warehouse, use Kimball methodology. I had the same dilemma when SSAS Tabular came around in SQL Server 2012. The conclusion that I reached was that it was somewhere between the two extremes: a fully normalized > = 3NF and flat-wide-deep denormalized mode.
Hadoop ecosystem also throws additional complexities in the mix – what file format to choose (Avro, Paraquet, ORC, Sequence and so on), what compression works best (Snappy, bzip etc), access patterns (is it scan, is it more like search), redundancy, data distribution,partitioning, distributed/parallel processing and so on. Some would give the golden answer “it depends”. However, I think there are some basic guidelines we can follow (while, at the same time, not disagreeing with the fact that it depends on the use case).

So in the same vein as  Zen of Python I would like to present my take on Zen of data modelling in Hadoop. In future posts – hopefully – I will expand further on each one.
So here it is

Zen of Data Modelling in Hadoop

Scans are better than joins
Distributed is better than centralised
Tabular view is better than nested
Redundancy is better than lookups
Compact is better than sparse
End user queries matter
No model is appropriate for all use cases so schema changes be graceful
Multiple views of the data are okay
Access to all data is not always needed, filtering at the source is great
Processing is better done closer to data
Chunk data into large blocks
ETL is not the end, so don’t get hung up on it, ETL  exists to serve data
Good enough is better than none at all, Better is acceptable than Best

So there you have it. As mentioned previously, the model will depend on the situation and lets hope these will be helpful starting point. They come with full disclaimer – Use at your own risk and are always subject to change 🙂

Changing role of IT in the BI world

BI is Dead, Long live BI

Timo Elliot (blog|twitter) recently published a blog BI is Dead which draws from Gartner’s Magic Quadrant report and a detailed report.

The main take away from the post (which includes references from the Gartner research ,so not all are Timo’s points) include:

  •  The self service analytics tools have matured enough that the involvement of IT in BI and analytics projects is deemed optional. IT don’t need to model the data upfront (which needed gathering all or partial analytics requirements to start with), analysts can prototype and test it themselves.
  •  The balance of power for BI and analytics platform buying decisions has gradually shifted from IT to business.

  • BI and analytics tools which do require intervention from IT are not considered BI anymore; they are Enterprise Reporting Based Platforms. Admittedly, they take most share of the BI market. In other words Gartner has updated the definition of BI.

  • The bold headline – “BI is Dead” – is to gain attention on the changing landscape of BI tools.

  •  Organizations who do not embrace the new definition of BI, run the danger of turning into BI-nosaurs.

  •  Having single view of data through an EDW is pointless/extremely hard. This and this from Curt Monash also support this line of thinking.

  •  Most organizations believe that IT has role to play in BI although majority want the responsibility of authoring the content to end users. This has always been the holy grail of BI.

So it is really dead?

Well-not necessarily. It’s dead in the sense that PCs are dead as we know them from 1980 or before. PCs have clearly evolved: they don’t look like how they used to look and many aspects of PC have been democratized. Trivial things which needed programmer in those days can now be done by users themselves. Heck, you can even upgrade the hardware if you are into that sort of thing.

I think that’s exactly what is happening in the BI and analytics world. The self service analytics tools have evolved to the point that many of the trivial data munging tasks can be done by the analysts themselves and what they can do with these new tools does not look at all like how they have been doing it. Does that mean the BI is dead? No- the fundamental analytics is still the same. It has evolved just like the PCs and the role of IT has changed.

Role of IT in the new BI world

The changing role of IT in BI and analytics can be best described from picture below.

Changing role of IT in the BI world

IT is now viewed as (and rightly so) as data facilitators. They make data available to the analysts in palatable form. The responsibility of modelling it,authoring presentation components and analysis lies with analysts.

Does that mean the job of IT got easier?

Most-Definitely-Not, on the contrary it has got even more complicated. With the influx of new tools, the demand for data has dramatically increased. The analysts want to analyse all sorts of data, from all sorts of unlikely sources and with all sorts of analytical methods. Their expectations from IT is to facilitate this process which is where ITs job has got a lot harder. We have to deal with ever increasing data sizes, from ever expanding sources, and support ever changing analytical tools.

What else is keeping BI alive?

If we assume the following definition of BI,then here are more reasons why BI is definitely not dead in its current incarnation.

an umbrella term to describe “concepts and methods to improve business decision making by using fact-based support systems”

1. Not everybody is data scientist

a. Combining data from multiple sources, b. creating data model and c.authoring reports from it, need special skillset. Many a times, analysts just want an Excel connecting to OLAP cube to do their analysis. Traditional BI has a place in this space.

2. Robust production ready solutions

The analysis done in the analysts R code or ipython notebook is not usually production ready. BI will be very much needed when it needs to be productionised.

3. If ETL is part of BI then its here to stay

How else would you provide quality data to analysts otherwise? And if the definition of BI includes methods to improve process of decision making, then we need BI.

4. Operational reporting is a fact of life

As much as we harp about ad-hoc analysis, data science and self-service BI, plain old operational reporting is a big part of any organization. For that we need classic BI.

SQL Server 2016

SQL Server 2016 was announced in Summer 2015 and CTP is available since then. It comes with some really cool features; JSON support, Polybase and Temporal Tables are some of my favorites. SSRS has seen some much needed face-lift. In all honesty, the UI for SSRS was getting a bit oldish. HTML rendering and Datazen integration are also welcoming improvements. The ability to deploy Power BI desktop reports to SSRS 2016, is also on road map. A very welcome news for organizations who aren’t using Power BI (as in cloud version) yet.

With that in mind, I have created a following presentation (or shall it call sway) which shows all the features I am looking forward to and msdn link for more information. Hope you find it useful too.

Please note that this is not an extensive list of features; they are the features I am interested in.

Creating a new SSIS package? Have you thought about these things?

Creating a package in SSIS is easy but creating a “good” SSIS package is a different story. As developers, we tend to jump right into building and creating that wonderfully simple package and often overlook the nitty-gritties. Being an avid developer myself, I must confess that I have fallen prey to this from time to time. When I find myself in a rush to create “that” package, I take a step back and ask myself “have you thought about these things?”. The list below is not comprehensive but it’s a good starting point. I will hopefully keep it updated as I come across more issues. Please note that it is not a SSIS best practices or SSIS optimization tips; these are more of high level things which I keep at the back of my mind whenever I am creating a new SSIS package.

1. SSIS Configuration
Configuration has to be the No.1 thing to think about when you start creating any SSIS package. It governs how the package will run in multiple environments so it is absolutely necessary to pay particular attention to configuration. Some of the questions to ask yourself are

  • Where is configuration stored? Is it XML, dtcConfig file, SQL Server?
  • How easy it is to change configured values?
  • What values are you going to store in configuration? Is it just connection manager or should you be storing any variable values as well.
  • What will happen if the package does not find a particular configured item? Would it fail? Would it do something it should not be doing?
  • How are you storing the passwords if there are any?
  • and so on..

2. SSIS Logging

There is certain minimum information each SSIS package must log. It is not only a good practice but it will make life much easier when package fails in production and you want to know where and why it failed. As rule of thumb, I think the following should be logged

  • The start and end of the package
  • The start and end of each task
  • Any errors on tasks. You should log as much information as possible about the error such as the error message, variable values when the error occurred, server names, file names under processing etc.
  • Row counts in the data flow

Which log provider to use is entirely up to you although I tend to create the log in a SQL Server database because it is easier to query that way.

3. Package restartability

Can you re-run the package as many times as you want? What if the package fails in between the operation? Would it start from where it failed? If it starts from the beginning what would it do?

4. Is your package atomic?

By “atomic”, I mean is it doing just one operation like “load date” or “load customer” or is it doing multiple operations like “load date and update fact”. It is always a good idea to keep packages atomic. This helps in restartability besides helping while debugging the ETL. If you think your package is doing multiple operations in one go, split it into multiple packages.

5. Are you using the correct SSIS tasks?

There are tasks in which SSIS is good at and there are tasks in which databases are good at. For example, databases are good at JOIN operations whereas SSIS can connect to an FTP site with ease. Are you using the optimum task? Can your current operation be done in pure TSQL? If yes, push it to the database.

6. Are you using event handlers?

Event handlers are great if you want to take alternative actions on certain events. For example, if the package fails, OnError event handler can be used to reset tables or notify somebody.

7. Have you thought about data source?

How are you getting data from data source? Is it the best way? Can you add a layer of abstraction between data source and your SSIS package? If you are reading from a relational database, can you create views on it rather than hard-coded SQL queries? If you are reading from flat files, have you set the data type correctly?

8. Naming convention

Is your package aptly named? Does it do what it says on the tin? Does it convey meaningful information about what the package is doing?

Same rules also apply to variables in the package.

9. SSIS Task Names

Have you renamed SSIS Tasks and are they descriptive enough to convey meaning of the operation they are performing?

10. Documentation/Annotations

Is your package well documented? Does it describe WHY it is doing something rather than WHAT it is doing? The former is considered a good documentation although in case of SSIS I find that even the later is very helpful because any new person doesn’t have to go through the package to understand what it is doing. SSIS annotations are great for in-package documentation and can be used effectively.

11. Is your package well structured both operationally and visually?

Can you box tasks into a series sequence containers? Does your package looks like a nice flow either from top-to-bottom or left-to-right? Are there any tasks which are hidden beneath other? Can the person looking at the package for first time grasp what’s happening without digging into each task?

12. Are you using an ETL framwork?

Having a generalized ETL framework will save a significant amount of time because many of the repetitive tasks such as logging, event handlers, configuration can be created in one “template” package and all other packages can be based on this template package.

Please leave a comment about what you think and if there is anything that you always keep in my mind when developing in SSIS. An older post of mine about things to be aware of while developing SSAS cubes is one of favorite and I can see this becoming one too!!

A warning about SSIS Foreach Loop container to process files from folder

SSIS Foreach Loop Container is frequently used to process files from specific folder. If the names of the files are not known beforehand then the usual way is to process all the files with specific extension. For e.g. all CSV files, or all XLSX files etc.

Untill recently I was under the impression that if you put “*.csv” in the Files textbox of the Collection tab on Foreach Loop editor, SSIS would look for only CSV files. However, this is not true. It appears that when SSIS looks for specfic file types in folder the search is a pattern based search. So if you put *.CSV, it will also process files with extensions like *.CSV_New or *.CSVOriginal.

To test this, I created a folder which contained following files with variations on extension CSV.

1. File1.CSV
5. File5.CSV.ARCHIEVED.201411232359
6. File6.CSV20141129
7. File7.A.CSV
8. File8.csv
9. File9.cSv
10. File10 i.e. no extension

I then created a simple SSIS package with Foreach Loop container which will iterate over these files and script task within it which will show message box with current filename. The Files textbox of the Collection tab on Foreach Loop editor contained *.CSV. On running the package, following files were picked by the container.
1. File1.CSV
4. File6.CSV20141129
5. File7.A.CSV
6. File8.csv
7. File9.cSv

Not a normal scenario to have files with various extensions in same folder but who knows. 🙂
If you are not sure which files would be processed by SSIS Foreach Loop container, the best thing to do would be navigate to the folder in Windows explorere and put the file extension in the search box. Those are the files which SSIS would pick up.

Large date dimensions in SSAS

I would like to share two simple tips on date dimension if the date range in your date dimension is large and uncertain.

A brief background

I was recently working on a project where I was faced with two situations.

1. The range of dates in the date dimension was large. Something from 1970’s to 2030 and beyond.

2. The source data can contain dates outside of this range i.e. before 1970 and after 2030 but we did not know for certain the exact range.

So here are the tips.

1. Add only required dates 

I could have just populated the date dimension with 200 years worth of dates i.e. 73000 records and be done with it. On closer inspection of source data, however, I found that data in fact table will be sparse for years outside of this range. There would be a few records which would have a date from 1955, a few from 1921 and so on. So why add those extra rows for year 1920 if the only date that is ever going to be used from this year is 01/02/1920.  Even for future dates, why bother adding all the dates from 2070 if ,now, the only date I need is, lets say, 23/09/2070.

To avoid fattening the date dimension, I created a static date range i.e. dates which are most often used. For dates outside of these I created a so-called ‘date sync’ mechanism. In a nutshell, all it does is at the end of dimension load and before the beginning of fact load, it goes through all the date fields in source tables (which are in staging or ODS by now) and makes sure that all dates are present in the date dimension. If they are not, it simply created a row for that particular day. It might seem a slow process but since the data is in relational engine by now, it is quite fast. Plus, it always makes sure that the date will always present in the date dimension so ETL won’t fail due to foreign key constraints.

2. Categories dates

So as mentioned before, our date range was wide so slicing and dicing using date hierarchy was painful because the year started from 1920′ till 2030 and beyond. To make browsing a little less problematic, we introduced a year category field. When we asked the business users, they were most interested in data from last 5 years to next 10 years. So we added a derived column which categorized the dates into various buckets like Pre-2008, 2008… and Post-2024.  We created an attribute based on this fields in date dimension and our date hierarchy looked like this.


Now, when the users dragged date hierarchy on rows in excel, they would see years before 2008 under Pre-2008 then all the years between 2008  and 2024 (which they were most interested in ) and then Post-2024. Nice and clean.

Hopefully these will be helpful to you in some way or might give you some better idea of handling large date dimension. If you have any suggestions, please feel free to drop me a line.

Building data model for SSAS Tabular

I have started preparing for Microsoft Certifications on SQL Server 2012 and I am spending a fair amount of time with SSAS Tabular since it’s the newest addition to MS BI stack and the one which I am not very accoustomed to. As I work with it more and more, I am starting to think about how would you go around designing the database on which the tabular model will be based. In the multi-dimensional world, this was an easy answer. You would almost always design the DW in Star/Snow-flake schema or atleast expose the DW as Star/Snow-flake schema through SQL Server Views or Data Source View (DSV). In SSAS Tabular, it appears to me, that this not really necessary.

Do I really need to segregate data in facts and dimension?
I can create measure out of any column in any table in SSAS Tabular. It doesn’t have to be from a fact table so why separate out facts and dimensions? Assume that I have a ‘House’ dimension with ‘NumberOfRooms’ as attribute. In SSAS Multidimensional, if the users wanted to slice and dice ‘NumberOfRooms’ by other attributes of house dimension or any other dimension, you would have to some how surface this as a fact table. SSAS Tabular will happily allow you to create measure out of this column although it’s in a dimension table. In theory, you can point the tabular model to the normalized OLTP database and still be able to give users the ability to slice and dice the data just like SSAS Multidimensional.

Normalized Or Denormalized, that is the question..
There would be obvious issues with building the tabular model on a normalized schema.One, since the database is normalized, the sheer number of tables will be very large. Two, maintaining relationships between the tables would be painful.Three, the model itself can be confusing for end users. On the other hand, building a fully denormalized schema when you don’t really need it adds complexity to the project, especially in the ETL. I don’t think I need to reiterate that the complex part of DW project is ETL and around 60-70% effort are spent on this phase. So it appears like we need some middle ground.

Enter Entity Based design..
I have always perceived the DW as a collection of entities like product, sales, employee etc.  which have relationship with each other. The star schema is similar but the dimensions don’t have relationship with each other directly. They are related to each other through the fact table. Plus, the dimensions are denormalized so as much as possible information is crammed into them. I guess what I am trying to say is we don’t need a fully denormalized structure and at the same time we don’t want the datawarehouse in 3NF. We would need to find a middle ground so that it is just about corretly denormalized and normalized at the same time. In my opinion, the way to get to this is to think of DW in terms of entities and how they interact with each other rather than classifying the data into dimension and facts.

Of course, these are all my prilimanary thoughts and are very much open to issues. I guess we would have to wait for wide spread adoption of the tabular model and learn from people who did it.