Zen of data modelling in Hadoop
The Zen of Python is well known tongue-in-cheek guidelines to writing Python code. If you haven’t read it, I would highly recommend reading it here Zen of Python.
There’s an “ester egg” in Python REPL. Type the following import into Python REPL and it will printout the Zen of Python.
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -let’s do more of those!
I have been pondering about how would you model your data in Hadoop. There are no clear guideline or hard and fast rules. As in, if it is an OLTP database use 3NF or if you are building warehouse, use Kimball methodology. I had the same dilemma when SSAS Tabular came around in SQL Server 2012. The conclusion that I reached was that it was somewhere between the two extremes: a fully normalized > = 3NF and flat-wide-deep denormalized mode.
Hadoop ecosystem also throws additional complexities in the mix – what file format to choose (Avro, Paraquet, ORC, Sequence and so on), what compression works best (Snappy, bzip etc), access patterns (is it scan, is it more like search), redundancy, data distribution,partitioning, distributed/parallel processing and so on. Some would give the golden answer “it depends”. However, I think there are some basic guidelines we can follow (while, at the same time, not disagreeing with the fact that it depends on the use case).
So in the same vein as Zen of Python I would like to present my take on Zen of data modelling in Hadoop. In future posts – hopefully – I will expand further on each one.
So here it is
Zen of Data Modelling in Hadoop
Scans are better than joins
Distributed is better than centralised
Tabular view is better than nested
Redundancy is better than lookups
Compact is better than sparse
End user queries matter
No model is appropriate for all use cases so schema changes be graceful
Multiple views of the data are okay
Access to all data is not always needed, filtering at the source is great
Processing is better done closer to data
Chunk data into large blocks
ETL is not the end, so don’t get hung up on it, ETL exists to serve data
Good enough is better than none at all, Better is acceptable than Best
So there you have it. As mentioned previously, the model will depend on the situation and lets hope these will be helpful starting point. They come with full disclaimer – Use at your own risk and are always subject to change 🙂