Tips

Zen of data modelling in Hadoop

Zen of data modelling in Hadoop

The Zen of Python is well known tongue-in-cheek guidelines to writing Python code. If you haven’t read it, I would highly recommend reading it here Zen of Python.

There’s an “ester egg” in Python REPL. Type the following import into Python REPL and it will printout the Zen of Python.

import this

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -let’s do more of those!

I have been pondering about how would you model your data in Hadoop. There are no clear guideline or hard and fast rules. As in, if it is an OLTP database use 3NF or if you are building warehouse, use Kimball methodology. I had the same dilemma when SSAS Tabular came around in SQL Server 2012. The conclusion that I reached was that it was somewhere between the two extremes: a fully normalized > = 3NF and flat-wide-deep denormalized mode.
Hadoop ecosystem also throws additional complexities in the mix – what file format to choose (Avro, Paraquet, ORC, Sequence and so on), what compression works best (Snappy, bzip etc), access patterns (is it scan, is it more like search), redundancy, data distribution,partitioning, distributed/parallel processing and so on. Some would give the golden answer “it depends”. However, I think there are some basic guidelines we can follow (while, at the same time, not disagreeing with the fact that it depends on the use case).

So in the same vein as  Zen of Python I would like to present my take on Zen of data modelling in Hadoop. In future posts – hopefully – I will expand further on each one.
So here it is

Zen of Data Modelling in Hadoop

Scans are better than joins
Distributed is better than centralised
Tabular view is better than nested
Redundancy is better than lookups
Compact is better than sparse
End user queries matter
No model is appropriate for all use cases so schema changes be graceful
Multiple views of the data are okay
Access to all data is not always needed, filtering at the source is great
Processing is better done closer to data
Chunk data into large blocks
ETL is not the end, so don’t get hung up on it, ETL  exists to serve data
Good enough is better than none at all, Better is acceptable than Best

So there you have it. As mentioned previously, the model will depend on the situation and lets hope these will be helpful starting point. They come with full disclaimer – Use at your own risk and are always subject to change 🙂

Advertisements

Knight’s Move Puzzle in TSQL

A colleague at work set us an interesting puzzle. The puzzle is a generic programming puzzle but we were required to solve it with TSQL. The person, who can come up with fastest solution, wins. Here’s the puzzle in question.

Puzzle


Imagine a keypad like below.

Alt

We want to find all of the unique 10 key sequences that can be entered into the keypad.

  • The initial key press can be any of the keys.
  • Each subsequent key press must be a knight move from the previous key press.
  • A key can be used any number of times.
  • There can be at most two vowels in each sequence.

A valid knight move is either

  • One step vertically and two steps horizontally
  • One step horizontally and two steps vertically

Examples below of valid moves, (from black to green)

Alt

The SQL Code must

  • Display the total number of possible sequences
  • Store all of the sequences in a Temp table named #Results
  • Not require any permission other than Public and connect on the database that is connected to the window we run the script in

The person who’s code runs fastest wins.

Full disclaimer: I did not win the challenge. A couple of co-workers developed really innovative solutions but I thought it was worth to put my approach in the blog so that I can comeback to it later.
I took- probably – a very simplistic approach and was fairly happy with the results. But that’s not the end of it. When I ran the same solution script on various SQL Server editions, the timings varied widely. They were getting better with newer versions of SQL Server.

Solution


The complete code from below is available on my  github account.

Convert to table

Let’s get the obvious out of the way – this was meant to be solved using TSQL so procedural approach was out of question. TSQL is, as we all know, excells at relational operations and we will get fastest solution if we exploit that. Next up, we want to use the other good part about RDBMS’ in general – the ability to index and do fast lookups.

With that in mind,lets model the keypad as a table. Each alphabet is unique – so that’s our primary key – and each has a set of properties: row location, column location and if it’s vowel or not. This vowel column comes handy later as you will see. The keypad table looks like below.

Alt

Knight’s moves as SQL query

When a knight moves, it arrives at a new location on keypad. This new location has a row number and column number and they can be derived from row and column number of original location and the directions of the move. There are essentially eight cases (or moves) from each character.

One step vertically and two steps horizontally
(1) One step up and two steps on left
(2) One step up and two steps on right
(3) One step down and two steps on left
(4) One step down and two steps on right

One step horizontally and two steps vertically
(5) One step left and two steps up
(6) One step left and two steps down
(7) One step right and two steps up
(8) One step right and two steps up

For each case above, we can join the keypad table to itself on row and column number to find all valid combinations. Lets take case 1 above and apply it to position of H. H is at row 2 and column 3. Lets denote it as H(2,3). One step above H is C and then two steps to the left is A which is A(1,1). So if we reduce row number by 1 and column number by 2 from H, we would arrive at A using a first valid knight’s move. That, essentially is our self join condition. In the same fashion,for case 2 above, to go from H(2,3)to E(1,5) we would reduce a row number but increase the column number by 2 since we are moving to the right. Here’s the sample query for first case

SELECT d1.val,d2.val AS c1_v1up_h2left,CASE
WHEN d2.val IN ( 'A', 'E', 'I', 'O' )
THEN 1
ELSE 0
END isnextvowel
FROM #keypad d1
INNER JOIN #keypad d2
ON d1.rownum - 1 = d2.rownum
AND d2.colnum = d1.colnum - 2

We can apply the same reasoning to all eight cases and that give us eight different queries. The union of all eight queries will be our set of all combinations of valid moves from any character from keypad. Also note that the inner join will automatically eliminate positions which does not exists. For e.g. for 1(4,2) there is no case 3 from above since there isn’t any row below one. Have we used left join, we would have received a record for this case but next val would have been empty. Below are the valid moves from 1.

Alt

Ten sequence chain

Next, we want to find a sequence of ten moves which contain less than two vowels. This is where the vowel column from the keypad table comes handy. But first lets look at how we find the sequence.
We have table at hand which contains every possible valid move from every character in the keypad. Let’s assume that we want to find only three character sequence. We start with our keypad table and then join to valid combination table on current value. So next value column from the valid combination table will be our second key from sequence. If we now join the valid combinations table to results from above on we get the third value in sequence. The join condition would be on next valid character from first result to original value column from second. We can repeat the process to get 10 key sequence.

Here’s the sample query which finds the first three characters

SELECT d.val s1,r1.nextval s2,r2.nextval s3
FROM #keypad d
INNER JOIN #validcombination r1
ON d.val = r1.val
INNER JOIN #validcombination r2
ON r1.nextval = r2.val

Removing vowels

In the ten joins query like above, each instance of the table has vowel column available. If we sum the values of this column in the result set, that tells us how many vowels are there in the sequence. So to filter them out, we simply put in a where clause which checks that this sum is less than two. Using mathematical operations is better than string comparison so SUM is more efficient than checking if more than two characters are vowels. Here are few records from results; you can verify yourself.

Alt

Timings

This is where it gets interesting. I ran the same script on four different versions of SQL Server: SQL Server 2008 R2 SP3, SQL Server 2012 SP1, SQL Server 2014 and SQL Server 2016 CTP 3.2. Admittedly, they ran on different hardware and SQL Server 2014 and SQL Server 2016 CTP was running in a virtual machine but hardware was similar; each had a four processor cores, 8 GB memory and SSD. In case of VMs the configuration exactly same and VM itself was hosted on SSD. I did expect the timings to change but the change was pretty stark. The results are given below. The first run was on a cold cache and subsequent ones are on warm cache. The average column at the end is the average of all.

SQL Version Run 1 Run 2 Run 3 Run 4 Run 5 Average
SQL Server 2008R2 1316 1310 1313 1333 1303 1315
SQL Server 2012 1026 1066 1106 1120 1056 1074.8
SQL Server 2014 403 980 396 373 393 509
SQL Server 2016 470 453 400 390 370 416.6

My theory on improvements in execution timings is that a. The way temporary tables are used in TSQL has been improved in each version of SQL Server and b. The SQL Server query optimizer has been improving with each version. The second point is more relevant since we know that optimizer is improved which each version of SQL Server. I did not check the query plans for each version but may be the answer lies there.

Further improvements to the solution

The code can certainly be improved, although, I believe that unless the approach is radically altered, the gains in improvements will be marginal. It will follow the law of diminishing marginal performance improvements (I completely made that up) whereby the amount of effort to improve performance will grow exponentially as compared to marginal performance benefit we get or, in other words, we will have to put a significant amount of efforts to improve the performance by a littl e. Having said that

  • Putting normal clustered and non clustered index can have an impact
  • How would actual table, instead of temporary tables, perform?
  • In SQL Server 2012 and onward, we can use columnstore index, would they have any impact? Which one would be better, non clustered columnstore or clustered?
  • What if we create the temporary table as in-memory tables in SQL Server 2014 and 2016, would they improve performance?
  • Any combination of the above?

Web scraping in R using rVest

I am not much conversant with web scraping but I undersand the importance of the technique given the fact that a lot of very useful data is embedded in HTML pages. Hence I was very excited when I came across this blog post on rstudio site which introduced a new package called rvest for web scraping. The github repository of the package is here.

As an excersie in scraping web pages, I set out to get all the Exchange Traded Funds (ETF) data from London Stock Exchange web site.

First things first, load up the rvest package and set out the base url, a download location where the html will be saved. You can do this without having to download the file but there were same proxy setting in the environment I was working on which prevented me from doing this. So I opted to download the html, process it and then to delete it.

library("rvest")
url <- "http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html"
download_folder <- "C:/R/"
etf_table <- data.frame()

Next thing to determine will be how many pages are there in ETF table. If you visit the url you would find that just above the table where ETFs are displayed, there is string which will tell us how many pages there are. It was 38 when I was writing the script. If you look at the source html, this string appears in paragraph tag whose class is floatsx.

Time to call html_nodes to get the part of html with a paragraph with class floatsx and then run html_text to get the actual string. Then its a matter of taking a substring of complete string to get the number of pages.

#find how many pages are there
download.file(url,paste(download_folder,"ETFs.html",sep=""))
html <- html(paste(download_folder,"ETFs.html",sep=""))
pages <- html_text(html_node(html,"p.floatsx"))
pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))

Now that we know how many pages are there, we want to iterate over each page and get ETF values from the table. Again load up the html and we call html_nodes but this time we are looking at all the tables. On this page there is just one table which displays all the ETF rates. So we are only interested in the first table.

#for each page
for (p in 1:pages) {
 cur_url <- paste(url,"?&page=",p,sep="")
 #download the file
 download.file(cur_url,paste(download_folder,p,".html",sep=""))
 #create html object
 html <- html(paste(download_folder,p,".html",sep=""))
 #look for tables on the page and get the first one
 table <- html_table(html_nodes(html,"table")[[1]])
 #only first 6 columns contain information that we need
 table <- table[1:6]
 #stick a timestamp at end
 table["Timestamp"] <- Sys.time()
 #add into the final results table
 etf_table <- rbind(etf_table,table)
 #remove the originally downloaded file
 file.remove(paste(download_folder,p,".html",sep=""))

 #summary
 summary(etf_table)
}

As you can see, rvest makes scrapping web data extremly simple so give it a try.The markdown file and knitted html is available on github link below if you want to run it in your own environment.
Github link

A warning about SSIS Foreach Loop container to process files from folder

SSIS Foreach Loop Container is frequently used to process files from specific folder. If the names of the files are not known beforehand then the usual way is to process all the files with specific extension. For e.g. all CSV files, or all XLSX files etc.

Untill recently I was under the impression that if you put “*.csv” in the Files textbox of the Collection tab on Foreach Loop editor, SSIS would look for only CSV files. However, this is not true. It appears that when SSIS looks for specfic file types in folder the search is a pattern based search. So if you put *.CSV, it will also process files with extensions like *.CSV_New or *.CSVOriginal.

To test this, I created a folder which contained following files with variations on extension CSV.

1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File4.CSV.ARCHIEVED
5. File5.CSV.ARCHIEVED.201411232359
6. File6.CSV20141129
7. File7.A.CSV
8. File8.csv
9. File9.cSv
10. File10 i.e. no extension

I then created a simple SSIS package with Foreach Loop container which will iterate over these files and script task within it which will show message box with current filename. The Files textbox of the Collection tab on Foreach Loop editor contained *.CSV. On running the package, following files were picked by the container.
1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File6.CSV20141129
5. File7.A.CSV
6. File8.csv
7. File9.cSv

Not a normal scenario to have files with various extensions in same folder but who knows. 🙂
If you are not sure which files would be processed by SSIS Foreach Loop container, the best thing to do would be navigate to the folder in Windows explorere and put the file extension in the search box. Those are the files which SSIS would pick up.

Are you setting ‘Slice’ property in SSAS partitions?

Are you setting ‘Slice’ property in SSAS partitions? If not, you should. Here’s what happened with me.  We have a fairly large SSAS cube. Initially this cube was having only two partitions, current data and historic data. The performance was degrading and so we partitioned the cube by month since majority of the queries are by month and below. This also helped us in improving the processing performance because we can now process only those partitions for which data has changed, two or three at the maximum. On the downside though, there are large number of partitions in cube; a fact we decided we can live with.
So I created the partitions and got the cube processed. I wrote a simple MDX query to observe how cube responds in profiler. Here is the query (slightly changed for anonymity).

SELECT
[Measures].[Measure Dummy] ON COLUMNS
FROM [Playbox]
WHERE [Period].[YearMonthDaySettlementPeriod].[Calendar Year].&[2010].&[4]

And here’s what I was seeing in profiler.

Without Slice all partitions are scanned

Without Slice, all the partitions are scanned

Hang on! I am querying only April 2010 data, why are those scans on all the other partitions? This defeated the whole purpose of partitioning by month because queries were scanning all the partitions. As mentioned earlier, we had a number of partitions and this is bound to have an impact.Even after creating partitions I was still not getting the benefit I wanted.

Now on the crux of the matter. As it turns out (after spending hours on google, msdn), automatic detection of Slice of SSAS 2008 is not so effective. Hence even though the partitions are clearly for a month, SSAS still has to scan all of them to get data. There are a number of blogs which elaborate the issue. I will just point to this, this and this.

So I set the slice property on each partition, run the same query and guess what, it is hitting only the required partition.

With Slice defined, only required partition is scanned

With Slice defined, only required partition is scanned

The performance benefit was quite apparent as well. I selected a very common real life query. I noted the query timing using the AS Performance Workbench which I have mentioned in my previous blog.  As you can see in the below images, the response time improved dramatically.

Query Duration without Slice

Query Duration without Slice

Query duration with Slice

Query duration with Slice

So the bottom line is “Always define slice when you create SSAS partition”. This has to go in my list of things to look for when you are building SSAS cube.