SQL Server

SQL Server 2016

SQL Server 2016 was announced in Summer 2015 and CTP is available since then. It comes with some really cool features; JSON support, Polybase and Temporal Tables are some of my favorites. SSRS has seen some much needed face-lift. In all honesty, the UI for SSRS was getting a bit oldish. HTML rendering and Datazen integration are also welcoming improvements. The ability to deploy Power BI desktop reports to SSRS 2016, is also on road map. A very welcome news for organizations who aren’t using Power BI (as in cloud version) yet.

With that in mind, I have created a following presentation (or shall it call sway) which shows all the features I am looking forward to and msdn link for more information. Hope you find it useful too.

Please note that this is not an extensive list of features; they are the features I am interested in.

https://sway.com/s/VwzKzCxF1UJTmNhX/embed

Knight’s Move Puzzle in TSQL

A colleague at work set us an interesting puzzle. The puzzle is a generic programming puzzle but we were required to solve it with TSQL. The person, who can come up with fastest solution, wins. Here’s the puzzle in question.

Puzzle


Imagine a keypad like below.

Alt

We want to find all of the unique 10 key sequences that can be entered into the keypad.

  • The initial key press can be any of the keys.
  • Each subsequent key press must be a knight move from the previous key press.
  • A key can be used any number of times.
  • There can be at most two vowels in each sequence.

A valid knight move is either

  • One step vertically and two steps horizontally
  • One step horizontally and two steps vertically

Examples below of valid moves, (from black to green)

Alt

The SQL Code must

  • Display the total number of possible sequences
  • Store all of the sequences in a Temp table named #Results
  • Not require any permission other than Public and connect on the database that is connected to the window we run the script in

The person who’s code runs fastest wins.

Full disclaimer: I did not win the challenge. A couple of co-workers developed really innovative solutions but I thought it was worth to put my approach in the blog so that I can comeback to it later.
I took- probably – a very simplistic approach and was fairly happy with the results. But that’s not the end of it. When I ran the same solution script on various SQL Server editions, the timings varied widely. They were getting better with newer versions of SQL Server.

Solution


The complete code from below is available on my  github account.

Convert to table

Let’s get the obvious out of the way – this was meant to be solved using TSQL so procedural approach was out of question. TSQL is, as we all know, excells at relational operations and we will get fastest solution if we exploit that. Next up, we want to use the other good part about RDBMS’ in general – the ability to index and do fast lookups.

With that in mind,lets model the keypad as a table. Each alphabet is unique – so that’s our primary key – and each has a set of properties: row location, column location and if it’s vowel or not. This vowel column comes handy later as you will see. The keypad table looks like below.

Alt

Knight’s moves as SQL query

When a knight moves, it arrives at a new location on keypad. This new location has a row number and column number and they can be derived from row and column number of original location and the directions of the move. There are essentially eight cases (or moves) from each character.

One step vertically and two steps horizontally
(1) One step up and two steps on left
(2) One step up and two steps on right
(3) One step down and two steps on left
(4) One step down and two steps on right

One step horizontally and two steps vertically
(5) One step left and two steps up
(6) One step left and two steps down
(7) One step right and two steps up
(8) One step right and two steps up

For each case above, we can join the keypad table to itself on row and column number to find all valid combinations. Lets take case 1 above and apply it to position of H. H is at row 2 and column 3. Lets denote it as H(2,3). One step above H is C and then two steps to the left is A which is A(1,1). So if we reduce row number by 1 and column number by 2 from H, we would arrive at A using a first valid knight’s move. That, essentially is our self join condition. In the same fashion,for case 2 above, to go from H(2,3)to E(1,5) we would reduce a row number but increase the column number by 2 since we are moving to the right. Here’s the sample query for first case

SELECT d1.val,d2.val AS c1_v1up_h2left,CASE
WHEN d2.val IN ( 'A', 'E', 'I', 'O' )
THEN 1
ELSE 0
END isnextvowel
FROM #keypad d1
INNER JOIN #keypad d2
ON d1.rownum - 1 = d2.rownum
AND d2.colnum = d1.colnum - 2

We can apply the same reasoning to all eight cases and that give us eight different queries. The union of all eight queries will be our set of all combinations of valid moves from any character from keypad. Also note that the inner join will automatically eliminate positions which does not exists. For e.g. for 1(4,2) there is no case 3 from above since there isn’t any row below one. Have we used left join, we would have received a record for this case but next val would have been empty. Below are the valid moves from 1.

Alt

Ten sequence chain

Next, we want to find a sequence of ten moves which contain less than two vowels. This is where the vowel column from the keypad table comes handy. But first lets look at how we find the sequence.
We have table at hand which contains every possible valid move from every character in the keypad. Let’s assume that we want to find only three character sequence. We start with our keypad table and then join to valid combination table on current value. So next value column from the valid combination table will be our second key from sequence. If we now join the valid combinations table to results from above on we get the third value in sequence. The join condition would be on next valid character from first result to original value column from second. We can repeat the process to get 10 key sequence.

Here’s the sample query which finds the first three characters

SELECT d.val s1,r1.nextval s2,r2.nextval s3
FROM #keypad d
INNER JOIN #validcombination r1
ON d.val = r1.val
INNER JOIN #validcombination r2
ON r1.nextval = r2.val

Removing vowels

In the ten joins query like above, each instance of the table has vowel column available. If we sum the values of this column in the result set, that tells us how many vowels are there in the sequence. So to filter them out, we simply put in a where clause which checks that this sum is less than two. Using mathematical operations is better than string comparison so SUM is more efficient than checking if more than two characters are vowels. Here are few records from results; you can verify yourself.

Alt

Timings

This is where it gets interesting. I ran the same script on four different versions of SQL Server: SQL Server 2008 R2 SP3, SQL Server 2012 SP1, SQL Server 2014 and SQL Server 2016 CTP 3.2. Admittedly, they ran on different hardware and SQL Server 2014 and SQL Server 2016 CTP was running in a virtual machine but hardware was similar; each had a four processor cores, 8 GB memory and SSD. In case of VMs the configuration exactly same and VM itself was hosted on SSD. I did expect the timings to change but the change was pretty stark. The results are given below. The first run was on a cold cache and subsequent ones are on warm cache. The average column at the end is the average of all.

SQL Version Run 1 Run 2 Run 3 Run 4 Run 5 Average
SQL Server 2008R2 1316 1310 1313 1333 1303 1315
SQL Server 2012 1026 1066 1106 1120 1056 1074.8
SQL Server 2014 403 980 396 373 393 509
SQL Server 2016 470 453 400 390 370 416.6

My theory on improvements in execution timings is that a. The way temporary tables are used in TSQL has been improved in each version of SQL Server and b. The SQL Server query optimizer has been improving with each version. The second point is more relevant since we know that optimizer is improved which each version of SQL Server. I did not check the query plans for each version but may be the answer lies there.

Further improvements to the solution

The code can certainly be improved, although, I believe that unless the approach is radically altered, the gains in improvements will be marginal. It will follow the law of diminishing marginal performance improvements (I completely made that up) whereby the amount of effort to improve performance will grow exponentially as compared to marginal performance benefit we get or, in other words, we will have to put a significant amount of efforts to improve the performance by a littl e. Having said that

  • Putting normal clustered and non clustered index can have an impact
  • How would actual table, instead of temporary tables, perform?
  • In SQL Server 2012 and onward, we can use columnstore index, would they have any impact? Which one would be better, non clustered columnstore or clustered?
  • What if we create the temporary table as in-memory tables in SQL Server 2014 and 2016, would they improve performance?
  • Any combination of the above?

Creating a new SSIS package? Have you thought about these things?

Creating a package in SSIS is easy but creating a “good” SSIS package is a different story. As developers, we tend to jump right into building and creating that wonderfully simple package and often overlook the nitty-gritties. Being an avid developer myself, I must confess that I have fallen prey to this from time to time. When I find myself in a rush to create “that” package, I take a step back and ask myself “have you thought about these things?”. The list below is not comprehensive but it’s a good starting point. I will hopefully keep it updated as I come across more issues. Please note that it is not a SSIS best practices or SSIS optimization tips; these are more of high level things which I keep at the back of my mind whenever I am creating a new SSIS package.

1. SSIS Configuration
Configuration has to be the No.1 thing to think about when you start creating any SSIS package. It governs how the package will run in multiple environments so it is absolutely necessary to pay particular attention to configuration. Some of the questions to ask yourself are

  • Where is configuration stored? Is it XML, dtcConfig file, SQL Server?
  • How easy it is to change configured values?
  • What values are you going to store in configuration? Is it just connection manager or should you be storing any variable values as well.
  • What will happen if the package does not find a particular configured item? Would it fail? Would it do something it should not be doing?
  • How are you storing the passwords if there are any?
  • and so on..

2. SSIS Logging

There is certain minimum information each SSIS package must log. It is not only a good practice but it will make life much easier when package fails in production and you want to know where and why it failed. As rule of thumb, I think the following should be logged

  • The start and end of the package
  • The start and end of each task
  • Any errors on tasks. You should log as much information as possible about the error such as the error message, variable values when the error occurred, server names, file names under processing etc.
  • Row counts in the data flow

Which log provider to use is entirely up to you although I tend to create the log in a SQL Server database because it is easier to query that way.

3. Package restartability

Can you re-run the package as many times as you want? What if the package fails in between the operation? Would it start from where it failed? If it starts from the beginning what would it do?

4. Is your package atomic?

By “atomic”, I mean is it doing just one operation like “load date” or “load customer” or is it doing multiple operations like “load date and update fact”. It is always a good idea to keep packages atomic. This helps in restartability besides helping while debugging the ETL. If you think your package is doing multiple operations in one go, split it into multiple packages.

5. Are you using the correct SSIS tasks?

There are tasks in which SSIS is good at and there are tasks in which databases are good at. For example, databases are good at JOIN operations whereas SSIS can connect to an FTP site with ease. Are you using the optimum task? Can your current operation be done in pure TSQL? If yes, push it to the database.

6. Are you using event handlers?

Event handlers are great if you want to take alternative actions on certain events. For example, if the package fails, OnError event handler can be used to reset tables or notify somebody.

7. Have you thought about data source?

How are you getting data from data source? Is it the best way? Can you add a layer of abstraction between data source and your SSIS package? If you are reading from a relational database, can you create views on it rather than hard-coded SQL queries? If you are reading from flat files, have you set the data type correctly?

8. Naming convention

Is your package aptly named? Does it do what it says on the tin? Does it convey meaningful information about what the package is doing?

Same rules also apply to variables in the package.

9. SSIS Task Names

Have you renamed SSIS Tasks and are they descriptive enough to convey meaning of the operation they are performing?

10. Documentation/Annotations

Is your package well documented? Does it describe WHY it is doing something rather than WHAT it is doing? The former is considered a good documentation although in case of SSIS I find that even the later is very helpful because any new person doesn’t have to go through the package to understand what it is doing. SSIS annotations are great for in-package documentation and can be used effectively.

11. Is your package well structured both operationally and visually?

Can you box tasks into a series sequence containers? Does your package looks like a nice flow either from top-to-bottom or left-to-right? Are there any tasks which are hidden beneath other? Can the person looking at the package for first time grasp what’s happening without digging into each task?

12. Are you using an ETL framwork?

Having a generalized ETL framework will save a significant amount of time because many of the repetitive tasks such as logging, event handlers, configuration can be created in one “template” package and all other packages can be based on this template package.

Please leave a comment about what you think and if there is anything that you always keep in my mind when developing in SSIS. An older post of mine about things to be aware of while developing SSAS cubes is one of favorite and I can see this becoming one too!!

A warning about SSIS Foreach Loop container to process files from folder

SSIS Foreach Loop Container is frequently used to process files from specific folder. If the names of the files are not known beforehand then the usual way is to process all the files with specific extension. For e.g. all CSV files, or all XLSX files etc.

Untill recently I was under the impression that if you put “*.csv” in the Files textbox of the Collection tab on Foreach Loop editor, SSIS would look for only CSV files. However, this is not true. It appears that when SSIS looks for specfic file types in folder the search is a pattern based search. So if you put *.CSV, it will also process files with extensions like *.CSV_New or *.CSVOriginal.

To test this, I created a folder which contained following files with variations on extension CSV.

1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File4.CSV.ARCHIEVED
5. File5.CSV.ARCHIEVED.201411232359
6. File6.CSV20141129
7. File7.A.CSV
8. File8.csv
9. File9.cSv
10. File10 i.e. no extension

I then created a simple SSIS package with Foreach Loop container which will iterate over these files and script task within it which will show message box with current filename. The Files textbox of the Collection tab on Foreach Loop editor contained *.CSV. On running the package, following files were picked by the container.
1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File6.CSV20141129
5. File7.A.CSV
6. File8.csv
7. File9.cSv

Not a normal scenario to have files with various extensions in same folder but who knows. 🙂
If you are not sure which files would be processed by SSIS Foreach Loop container, the best thing to do would be navigate to the folder in Windows explorere and put the file extension in the search box. Those are the files which SSIS would pick up.

Faster SSAS Processing by using Shared Memory Protocol

I came across this very handy tip to increase SSAS processing speed in SQL Server 2008 R2 Analysis Services Operations Guide and thought it is worth sharing.
The guide recommendes using Shared Memory Protocol when getting data from the relational data source if it is SQL Server and both SSAS and SQL Server are on the same physical box. Exchaning data over shared memory is much faster than over TCP/IP as you probably already know. You will need to perform two steps to force SSAS data source to use shared memory while querying underlying SQL Server.
1. Check that Shared Memory Protocol is enabled in SQL Server configuration manager.
2. Prefix the data source connection string in SSAS with :lpc like below.

Provider=SQLNCLI10.1;Data Source=lpc:ThisServer\INST1;Integrated Security=SSPI;Initial Catalog=WordMatch

The guide claims to have achieved 112,000 rows per second using TCP/IP where as 180,000 rows per second using shared memory which is impressive. In my own test, a slightly modified Adventure Works cube took 56 seconds to process using TCP-IP whereas 47 seconds using shared memory; an improvement of 16%.

Calculated distinct count measures in SSAS

Distinct count measures are fact-of-life for an SSAS developer. No matter how much we try to avoid them they are always asked for and we have to deal with them somehow. Another painful fact is, as you might already know, we cannot create multiple distinct count measures in the same measure group. So each distinct count measure has to sit in its own measure group which,IMO, does not look right when browsing the cube. In this post, I want to show a method of creating distinct count calculated measures which I found on Richard Lees blog here with slight modification.
http://richardlees.blogspot.co.uk/2008/10/alternative-to-physical-distinct-count.html

Using Adventure Works, let’s say the end users want to know the distinct count of customers who have placed orders on the internet. I can add a calculation like this in the cube

CREATE MEMBER CURRENTCUBE.[Measures].[Unique Customers]
AS COUNT(NONEMPTY([Customer].[Customer].[Customer].members,
[Measures].[Internet Order Count])),
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Orders' ;

This is all fine and dandy, however, as soon as I added any attribute from customer dimension on the rows or filters, the results were showing incorrect values i.e. the same count of customers was repeated for all records.

The solution is to count the number of unique customers in the context of current attributes of customer dimension. For examples sake, lets take Customer City attribute. I tweaked the calculation like below to count customers only in the context of current members in City attributes and it started working as expected when Customer City attribute is used in rows, columns or filters.

CREATE MEMBER CURRENTCUBE.[Measures].[Unique Customers]
AS COUNT(NONEMPTY(
CROSSJOIN(
[Customer].[Customer].[Customer].members,
[Customer].[City].CurrentMember
),[Measures].[Internet Order Count])),
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Orders' ;

Of course, you will have to add all the dimension attributes in the CROSSJOIN but ultimately a calculated, though complex, distinct count measure is better than having a number of physical distinct count measures IMHO.

 

Table variables are fast but #Temp tables are faster!!

This blog is sort of note to myself. Use-temp-tables-instead-of-table-variables-when-the-number-of-records-is-large. There it is and I promise to remember that every time I write DECLARE @something TABLE (… statement.
And now a bit of explanation. I recently worked on project where we added a new fact table to our datawareshouse. An SSIS package was created to get the data from source, do the dimension lookup and populate the fact table. Simple stuff, except that I had to do some data manipulation while selecting data from the source. Now, I could have just written a stored procedure and used that as data source but we are all aware of the pitfalls of that so I decided to use a table valued functions (TVF). TVFs have predictable metadata output and SSIS plays nice with them. However, you cannot use temp tables in TVF so invariably I ended up using a table variable. The package worked perfectly for smaller workloads but when we increased the number of records in table variable, the performance became more and more severe. A casual google search came up with a raft of blog post suggesting that performance culprit might be table variable. So I changed the function to a stored procedure with temp table and used the stored procedure in SSIS. Sure enough our performance issue was resolved like it never existed. Now, there are many blog posts by very intelligent people on why temp tables perform better than table variables but I wanted to check first hand and here’s my little investigation.

1. Create two temp table and poupulated them with random data. These have nothing to do with actual results they are just used for data population.

/* Create base tables which will contain random generated data */
IF OBJECT_ID('tempdb..#basetable') IS NOT NULL
	DROP TABLE #basetable

IF OBJECT_ID('tempdb..#basetable2') IS NOT NULL
	DROP TABLE #basetable2

CREATE TABLE #basetable (
	id INT
	,pdate DATETIME
	,sdate DATETIME
	)

CREATE TABLE #basetable2 (
	id INT
	,pdate DATETIME
	);

WITH id
AS (
	SELECT id = 1
	
	UNION ALL
	
	SELECT id + 1
	FROM id
	WHERE id <= 1000
	)
INSERT INTO #basetable
SELECT id
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
FROM id
OPTION (MAXRECURSION&nbsp;0)

;WITH id
AS (
	SELECT id = 1
	
	UNION ALL
	
	SELECT id + 1
	FROM id
	WHERE id <= 1000
	)
INSERT INTO #basetable2
SELECT id
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
FROM id
OPTION (MAXRECURSION 0)

2. Get the records generated in temp table above into two table variable, join them and update the first one with values from the second one.

/* Get the records generated above in two temp variables. 
Join the table variables on id column and update pdate&nbsp;column */
DECLARE @table1&nbsp;TABLE (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)
DECLARE @table2&nbsp;TABLE (
	id INT
	,pdate&nbsp;DATETIME
	)

INSERT INTO @table1
SELECT id
	,pdate
	,sdate
FROM #basetable

INSERT INTO @table2
SELECT id
	,pdate
FROM #basetable2

UPDATE @table1
SET pdate&nbsp;= t2.pdate
FROM @table1&nbsp;t1
LEFT JOIN @table2 t2
	ON t1.id = t2.id

If you see the estimated and actual execution plan below, they are exactly same. SQL Server optimizer decides to use nested loop join because there are no statistics on table variables.
Not to mention this is very slow specially when I increased the number of records to over a million.

Estimated Query Execution plan when using table variable

Estimated Query Execution plan when using table variable

Actual Query Execution plan when using table variables

Actual Query Execution plan when using table variables

3. Get the records generated in temp table from step 1 in another set of temp tables, join them and update the first one with values from second one.

 
/* Get the records generated above in two temp tables. 
Join the two temp tables on id column and update pdate&nbsp;column */

IF OBJECT_ID('tempdb..#table1') IS NOT NULL
	DROP TABLE #table1

IF OBJECT_ID('tempdb..#table2') IS NOT NULL
	DROP TABLE #table2

CREATE TABLE #table1 (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)

CREATE TABLE #table2 (
	id INT
	,pdate&nbsp;DATETIME
	)

INSERT INTO #table1
SELECT id
	,pdate
	,sdate
FROM #basetable

INSERT INTO #table2
SELECT id
	,pdate
FROM #basetable2

UPDATE #table1
SET pdate = t2.pdate
FROM #table1 t1
LEFT JOIN #table2 t2
	ON t1.id = t2.id

Now look at the estimated and actual execution plan. SQL Server estimated using nested loop join but used hash-match which is faster in this situation. SQL Server was able to do this because there are statistics available on temporary tables.

Estimated Query Execution plan when using temp table

Estimated Query Execution plan when using temp table

Actual Query Execution plan when using temp table

Actual Query Execution plan when using temp table

This does not absolutely mean that table variables are not useful in any situation. They are brilliant when number of data in them is small (usually less than 1000 rows as rule of thumb). Creating a temp table in this situation would not give much, if at all, benefits. They are light on transaction log as well since their scope is limited to current execution context and they also lead to fewer stored procedure recompilation if used in stored procedure.
So next time you decide to use table variable have a second thought. Think a bit about the effect on the query execution plan and definitely don’t use them if you care moving large amount of data.