SQL Server

SQL Server 2016

SQL Server 2016 was announced in Summer 2015 and CTP is available since then. It comes with some really cool features; JSON support, Polybase and Temporal Tables are some of my favorites. SSRS has seen some much needed face-lift. In all honesty, the UI for SSRS was getting a bit oldish. HTML rendering and Datazen integration are also welcoming improvements. The ability to deploy Power BI desktop reports to SSRS 2016, is also on road map. A very welcome news for organizations who aren’t using Power BI (as in cloud version) yet.

With that in mind, I have created a following presentation (or shall it call sway) which shows all the features I am looking forward to and msdn link for more information. Hope you find it useful too.

Please note that this is not an extensive list of features; they are the features I am interested in.

https://sway.com/s/VwzKzCxF1UJTmNhX/embed

Knight’s Move Puzzle in TSQL

A colleague at work set us an interesting puzzle. The puzzle is a generic programming puzzle but we were required to solve it with TSQL. The person, who can come up with fastest solution, wins. Here’s the puzzle in question.

Puzzle


Imagine a keypad like below.

Alt

We want to find all of the unique 10 key sequences that can be entered into the keypad.

  • The initial key press can be any of the keys.
  • Each subsequent key press must be a knight move from the previous key press.
  • A key can be used any number of times.
  • There can be at most two vowels in each sequence.

A valid knight move is either

  • One step vertically and two steps horizontally
  • One step horizontally and two steps vertically

Examples below of valid moves, (from black to green)

Alt

The SQL Code must

  • Display the total number of possible sequences
  • Store all of the sequences in a Temp table named #Results
  • Not require any permission other than Public and connect on the database that is connected to the window we run the script in

The person who’s code runs fastest wins.

Full disclaimer: I did not win the challenge. A couple of co-workers developed really innovative solutions but I thought it was worth to put my approach in the blog so that I can comeback to it later.
I took- probably – a very simplistic approach and was fairly happy with the results. But that’s not the end of it. When I ran the same solution script on various SQL Server editions, the timings varied widely. They were getting better with newer versions of SQL Server.

Solution


The complete code from below is available on my  github account.

Convert to table

Let’s get the obvious out of the way – this was meant to be solved using TSQL so procedural approach was out of question. TSQL is, as we all know, excells at relational operations and we will get fastest solution if we exploit that. Next up, we want to use the other good part about RDBMS’ in general – the ability to index and do fast lookups.

With that in mind,lets model the keypad as a table. Each alphabet is unique – so that’s our primary key – and each has a set of properties: row location, column location and if it’s vowel or not. This vowel column comes handy later as you will see. The keypad table looks like below.

Alt

Knight’s moves as SQL query

When a knight moves, it arrives at a new location on keypad. This new location has a row number and column number and they can be derived from row and column number of original location and the directions of the move. There are essentially eight cases (or moves) from each character.

One step vertically and two steps horizontally
(1) One step up and two steps on left
(2) One step up and two steps on right
(3) One step down and two steps on left
(4) One step down and two steps on right

One step horizontally and two steps vertically
(5) One step left and two steps up
(6) One step left and two steps down
(7) One step right and two steps up
(8) One step right and two steps up

For each case above, we can join the keypad table to itself on row and column number to find all valid combinations. Lets take case 1 above and apply it to position of H. H is at row 2 and column 3. Lets denote it as H(2,3). One step above H is C and then two steps to the left is A which is A(1,1). So if we reduce row number by 1 and column number by 2 from H, we would arrive at A using a first valid knight’s move. That, essentially is our self join condition. In the same fashion,for case 2 above, to go from H(2,3)to E(1,5) we would reduce a row number but increase the column number by 2 since we are moving to the right. Here’s the sample query for first case

SELECT d1.val,d2.val AS c1_v1up_h2left,CASE
WHEN d2.val IN ( 'A', 'E', 'I', 'O' )
THEN 1
ELSE 0
END isnextvowel
FROM #keypad d1
INNER JOIN #keypad d2
ON d1.rownum - 1 = d2.rownum
AND d2.colnum = d1.colnum - 2

We can apply the same reasoning to all eight cases and that give us eight different queries. The union of all eight queries will be our set of all combinations of valid moves from any character from keypad. Also note that the inner join will automatically eliminate positions which does not exists. For e.g. for 1(4,2) there is no case 3 from above since there isn’t any row below one. Have we used left join, we would have received a record for this case but next val would have been empty. Below are the valid moves from 1.

Alt

Ten sequence chain

Next, we want to find a sequence of ten moves which contain less than two vowels. This is where the vowel column from the keypad table comes handy. But first lets look at how we find the sequence.
We have table at hand which contains every possible valid move from every character in the keypad. Let’s assume that we want to find only three character sequence. We start with our keypad table and then join to valid combination table on current value. So next value column from the valid combination table will be our second key from sequence. If we now join the valid combinations table to results from above on we get the third value in sequence. The join condition would be on next valid character from first result to original value column from second. We can repeat the process to get 10 key sequence.

Here’s the sample query which finds the first three characters

SELECT d.val s1,r1.nextval s2,r2.nextval s3
FROM #keypad d
INNER JOIN #validcombination r1
ON d.val = r1.val
INNER JOIN #validcombination r2
ON r1.nextval = r2.val

Removing vowels

In the ten joins query like above, each instance of the table has vowel column available. If we sum the values of this column in the result set, that tells us how many vowels are there in the sequence. So to filter them out, we simply put in a where clause which checks that this sum is less than two. Using mathematical operations is better than string comparison so SUM is more efficient than checking if more than two characters are vowels. Here are few records from results; you can verify yourself.

Alt

Timings

This is where it gets interesting. I ran the same script on four different versions of SQL Server: SQL Server 2008 R2 SP3, SQL Server 2012 SP1, SQL Server 2014 and SQL Server 2016 CTP 3.2. Admittedly, they ran on different hardware and SQL Server 2014 and SQL Server 2016 CTP was running in a virtual machine but hardware was similar; each had a four processor cores, 8 GB memory and SSD. In case of VMs the configuration exactly same and VM itself was hosted on SSD. I did expect the timings to change but the change was pretty stark. The results are given below. The first run was on a cold cache and subsequent ones are on warm cache. The average column at the end is the average of all.

SQL Version Run 1 Run 2 Run 3 Run 4 Run 5 Average
SQL Server 2008R2 1316 1310 1313 1333 1303 1315
SQL Server 2012 1026 1066 1106 1120 1056 1074.8
SQL Server 2014 403 980 396 373 393 509
SQL Server 2016 470 453 400 390 370 416.6

My theory on improvements in execution timings is that a. The way temporary tables are used in TSQL has been improved in each version of SQL Server and b. The SQL Server query optimizer has been improving with each version. The second point is more relevant since we know that optimizer is improved which each version of SQL Server. I did not check the query plans for each version but may be the answer lies there.

Further improvements to the solution

The code can certainly be improved, although, I believe that unless the approach is radically altered, the gains in improvements will be marginal. It will follow the law of diminishing marginal performance improvements (I completely made that up) whereby the amount of effort to improve performance will grow exponentially as compared to marginal performance benefit we get or, in other words, we will have to put a significant amount of efforts to improve the performance by a littl e. Having said that

  • Putting normal clustered and non clustered index can have an impact
  • How would actual table, instead of temporary tables, perform?
  • In SQL Server 2012 and onward, we can use columnstore index, would they have any impact? Which one would be better, non clustered columnstore or clustered?
  • What if we create the temporary table as in-memory tables in SQL Server 2014 and 2016, would they improve performance?
  • Any combination of the above?

Faster SSAS Processing by using Shared Memory Protocol

I came across this very handy tip to increase SSAS processing speed in SQL Server 2008 R2 Analysis Services Operations Guide and thought it is worth sharing.
The guide recommendes using Shared Memory Protocol when getting data from the relational data source if it is SQL Server and both SSAS and SQL Server are on the same physical box. Exchaning data over shared memory is much faster than over TCP/IP as you probably already know. You will need to perform two steps to force SSAS data source to use shared memory while querying underlying SQL Server.
1. Check that Shared Memory Protocol is enabled in SQL Server configuration manager.
2. Prefix the data source connection string in SSAS with :lpc like below.

Provider=SQLNCLI10.1;Data Source=lpc:ThisServer\INST1;Integrated Security=SSPI;Initial Catalog=WordMatch

The guide claims to have achieved 112,000 rows per second using TCP/IP where as 180,000 rows per second using shared memory which is impressive. In my own test, a slightly modified Adventure Works cube took 56 seconds to process using TCP-IP whereas 47 seconds using shared memory; an improvement of 16%.

Calculated distinct count measures in SSAS

Distinct count measures are fact-of-life for an SSAS developer. No matter how much we try to avoid them they are always asked for and we have to deal with them somehow. Another painful fact is, as you might already know, we cannot create multiple distinct count measures in the same measure group. So each distinct count measure has to sit in its own measure group which,IMO, does not look right when browsing the cube. In this post, I want to show a method of creating distinct count calculated measures which I found on Richard Lees blog here with slight modification.
http://richardlees.blogspot.co.uk/2008/10/alternative-to-physical-distinct-count.html

Using Adventure Works, let’s say the end users want to know the distinct count of customers who have placed orders on the internet. I can add a calculation like this in the cube

CREATE MEMBER CURRENTCUBE.[Measures].[Unique Customers]
AS COUNT(NONEMPTY([Customer].[Customer].[Customer].members,
[Measures].[Internet Order Count])),
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Orders' ;

This is all fine and dandy, however, as soon as I added any attribute from customer dimension on the rows or filters, the results were showing incorrect values i.e. the same count of customers was repeated for all records.

The solution is to count the number of unique customers in the context of current attributes of customer dimension. For examples sake, lets take Customer City attribute. I tweaked the calculation like below to count customers only in the context of current members in City attributes and it started working as expected when Customer City attribute is used in rows, columns or filters.

CREATE MEMBER CURRENTCUBE.[Measures].[Unique Customers]
AS COUNT(NONEMPTY(
CROSSJOIN(
[Customer].[Customer].[Customer].members,
[Customer].[City].CurrentMember
),[Measures].[Internet Order Count])),
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Orders' ;

Of course, you will have to add all the dimension attributes in the CROSSJOIN but ultimately a calculated, though complex, distinct count measure is better than having a number of physical distinct count measures IMHO.

 

Displaying last cube processed time in SSAS

Recently, while browsing a cube excel, an end-user asked me when was it last processed? I had no quick answer and so had to connect to the SSAS server in management studio and check the properties of the cube. This got me thinking on how to expose the cube’s last processed date and time to end users through Excel or similar client tools. This is very helpful to them in that they can quickly check when was the data last added. Think about it like this. A user builds a pivot table report in Excel with a filter at top which shows when was the cube processed. He then copies the pivot table data and sends to his colleague. His colleague checks the numbers and instantly knows that they are based on data from certain date. If he did not have the last processed date, the assumption can be made that the numbers are based on latest set of data which might not be true. So with that in mind, I decided to implement a simple solution to expose when the cube was last processed and here are the steps.

1. Create table in data warehouse which would have just one column storing date and time when the cube processing finished. This table will always have one record.

2. The record in the table above will be updated with latest timestamp after the cube has been processed. This can be done in the ETL.

3. Import this table in the DSV and create a dimension based on it. This dimension will have just one attribute and one member which will the the last time cube was processed. You might want to be careful with naming of this dimension. It should appear either at the top or bottom in fields list in pivot table so that it does not interfere with the analysis.

4. Set IsAggregatable property of the attribute to False. This is to remove the ‘All’ member because there’s always going to be one member in the dimension attributes,

5. Add this dimension in Dimension Usage matrix. You don’t have to worry about attaching it to any measure group.

6. Add the new dimension to the perspectives if you have any and that’s it.Deploy and process the cube.

To test, connect to the cube in Excel, drag the attribute in report filter and it should show when was the cube last processed. Here’s a little screenshot when I implemented this for Adventure Works 2012 cube. And yes, don’t forget to process the dimension while processing the SSAS database.

CubeLastProcessedDateTime

Filter shows when the cube was last processed

Hope that was helpful. Any suggestions are always welcome.

Large date dimensions in SSAS

I would like to share two simple tips on date dimension if the date range in your date dimension is large and uncertain.

A brief background

I was recently working on a project where I was faced with two situations.

1. The range of dates in the date dimension was large. Something from 1970’s to 2030 and beyond.

2. The source data can contain dates outside of this range i.e. before 1970 and after 2030 but we did not know for certain the exact range.

So here are the tips.

1. Add only required dates 

I could have just populated the date dimension with 200 years worth of dates i.e. 73000 records and be done with it. On closer inspection of source data, however, I found that data in fact table will be sparse for years outside of this range. There would be a few records which would have a date from 1955, a few from 1921 and so on. So why add those extra rows for year 1920 if the only date that is ever going to be used from this year is 01/02/1920.  Even for future dates, why bother adding all the dates from 2070 if ,now, the only date I need is, lets say, 23/09/2070.

To avoid fattening the date dimension, I created a static date range i.e. dates which are most often used. For dates outside of these I created a so-called ‘date sync’ mechanism. In a nutshell, all it does is at the end of dimension load and before the beginning of fact load, it goes through all the date fields in source tables (which are in staging or ODS by now) and makes sure that all dates are present in the date dimension. If they are not, it simply created a row for that particular day. It might seem a slow process but since the data is in relational engine by now, it is quite fast. Plus, it always makes sure that the date will always present in the date dimension so ETL won’t fail due to foreign key constraints.

2. Categories dates

So as mentioned before, our date range was wide so slicing and dicing using date hierarchy was painful because the year started from 1920′ till 2030 and beyond. To make browsing a little less problematic, we introduced a year category field. When we asked the business users, they were most interested in data from last 5 years to next 10 years. So we added a derived column which categorized the dates into various buckets like Pre-2008, 2008… and Post-2024.  We created an attribute based on this fields in date dimension and our date hierarchy looked like this.

DateCategory–>Year–>Month–>Date

Now, when the users dragged date hierarchy on rows in excel, they would see years before 2008 under Pre-2008 then all the years between 2008  and 2024 (which they were most interested in ) and then Post-2024. Nice and clean.

Hopefully these will be helpful to you in some way or might give you some better idea of handling large date dimension. If you have any suggestions, please feel free to drop me a line.

Select distinct values from multiple columns in same table

Selecting a distinct value from a column is no rocket science, but what if you want distinct values from two, three or more columns from a table. I don’t want a distinct combination of values from multiple columns (which will what DISTINCT col1, col2,col3 etc. would do), I want distinct values from first column, then the second, then the third and so on.
The simplest way would be to do SELECT DISTINCT on each column and then UNION them all. But I was looking for something a bit more concise; my table apparently had ten date columns wherein I wanted to find all the unique dates. Enter PIVOT/UNPIVOT twins. If you think about it all we need to do is UNPIVOT the columns we are interested in and then select DISTINCT on them. Simples!. Of course, you would have to cast them to same data type; but I suppose that’s fairly straightforward. So here is an example.

Consider a table like this

Order Number Order Date Payment Date Shipped Date Feedback Date
A123321 01/01/2013 02/01/2013 03/01/2013 10/01/2013
B890098 01/02/2013 06/02/2013 08/02/2013 11/02/2013
C678876 01/01/2013 08/02/2013 11/02/2013 03/03/2013
D342243 07/09/2013 09/09/1023 10/09/2013 11/09/2013

Lets say we want to find all the unique dates in this table. I repeat, we don’t want to find combination of unique dates, which is what DISTINCT would do, we want to find unique dates in all the columns.  As mentioned before, we can do it using DISTINCT & UNION as follows

SELECT DISTINCT [Order Date] FROM [dbo].[Order]


UNION

SELECT DISTINCT [Payment Date] FROM [dbo].[Order]

UNION

and so on….

This can be done concisely using UNPIVOT as follows.

SELECT DISTINCT Dates from
(SELECT [Order Date]
,[Payment Date]
,[Shipped Date]
,[Feedback Date]
FROM [MyDb].[dbo].[Order] ) p
UNPIVOT (Dates FOR UniqueDates IN
([Order Date]
,[Payment Date]
,[Shipped Date]
,[Feedback Date])
) AS unpvt

If the columns had different data types, you would have to cast them but the approach essentially remains same. Hope that was helpful and comments are always welcome.