Tips

Creating a new SSIS package? Have you thought about these things?

Creating a package in SSIS is easy but creating a “good” SSIS package is a different story. As developers, we tend to jump right into building and creating that wonderfully simple package and often overlook the nitty-gritties. Being an avid developer myself, I must confess that I have fallen prey to this from time to time. When I find myself in a rush to create “that” package, I take a step back and ask myself “have you thought about these things?”. The list below is not comprehensive but it’s a good starting point. I will hopefully keep it updated as I come across more issues. Please note that it is not a SSIS best practices or SSIS optimization tips; these are more of high level things which I keep at the back of my mind whenever I am creating a new SSIS package.

1. SSIS Configuration
Configuration has to be the No.1 thing to think about when you start creating any SSIS package. It governs how the package will run in multiple environments so it is absolutely necessary to pay particular attention to configuration. Some of the questions to ask yourself are

  • Where is configuration stored? Is it XML, dtcConfig file, SQL Server?
  • How easy it is to change configured values?
  • What values are you going to store in configuration? Is it just connection manager or should you be storing any variable values as well.
  • What will happen if the package does not find a particular configured item? Would it fail? Would it do something it should not be doing?
  • How are you storing the passwords if there are any?
  • and so on..

2. SSIS Logging

There is certain minimum information each SSIS package must log. It is not only a good practice but it will make life much easier when package fails in production and you want to know where and why it failed. As rule of thumb, I think the following should be logged

  • The start and end of the package
  • The start and end of each task
  • Any errors on tasks. You should log as much information as possible about the error such as the error message, variable values when the error occurred, server names, file names under processing etc.
  • Row counts in the data flow

Which log provider to use is entirely up to you although I tend to create the log in a SQL Server database because it is easier to query that way.

3. Package restartability

Can you re-run the package as many times as you want? What if the package fails in between the operation? Would it start from where it failed? If it starts from the beginning what would it do?

4. Is your package atomic?

By “atomic”, I mean is it doing just one operation like “load date” or “load customer” or is it doing multiple operations like “load date and update fact”. It is always a good idea to keep packages atomic. This helps in restartability besides helping while debugging the ETL. If you think your package is doing multiple operations in one go, split it into multiple packages.

5. Are you using the correct SSIS tasks?

There are tasks in which SSIS is good at and there are tasks in which databases are good at. For example, databases are good at JOIN operations whereas SSIS can connect to an FTP site with ease. Are you using the optimum task? Can your current operation be done in pure TSQL? If yes, push it to the database.

6. Are you using event handlers?

Event handlers are great if you want to take alternative actions on certain events. For example, if the package fails, OnError event handler can be used to reset tables or notify somebody.

7. Have you thought about data source?

How are you getting data from data source? Is it the best way? Can you add a layer of abstraction between data source and your SSIS package? If you are reading from a relational database, can you create views on it rather than hard-coded SQL queries? If you are reading from flat files, have you set the data type correctly?

8. Naming convention

Is your package aptly named? Does it do what it says on the tin? Does it convey meaningful information about what the package is doing?

Same rules also apply to variables in the package.

9. SSIS Task Names

Have you renamed SSIS Tasks and are they descriptive enough to convey meaning of the operation they are performing?

10. Documentation/Annotations

Is your package well documented? Does it describe WHY it is doing something rather than WHAT it is doing? The former is considered a good documentation although in case of SSIS I find that even the later is very helpful because any new person doesn’t have to go through the package to understand what it is doing. SSIS annotations are great for in-package documentation and can be used effectively.

11. Is your package well structured both operationally and visually?

Can you box tasks into a series sequence containers? Does your package looks like a nice flow either from top-to-bottom or left-to-right? Are there any tasks which are hidden beneath other? Can the person looking at the package for first time grasp what’s happening without digging into each task?

12. Are you using an ETL framwork?

Having a generalized ETL framework will save a significant amount of time because many of the repetitive tasks such as logging, event handlers, configuration can be created in one “template” package and all other packages can be based on this template package.

Please leave a comment about what you think and if there is anything that you always keep in my mind when developing in SSIS. An older post of mine about things to be aware of while developing SSAS cubes is one of favorite and I can see this becoming one too!!

Advertisements

Web scraping in R using rVest

I am not much conversant with web scraping but I undersand the importance of the technique given the fact that a lot of very useful data is embedded in HTML pages. Hence I was very excited when I came across this blog post on rstudio site which introduced a new package called rvest for web scraping. The github repository of the package is here.

As an excersie in scraping web pages, I set out to get all the Exchange Traded Funds (ETF) data from London Stock Exchange web site.

First things first, load up the rvest package and set out the base url, a download location where the html will be saved. You can do this without having to download the file but there were same proxy setting in the environment I was working on which prevented me from doing this. So I opted to download the html, process it and then to delete it.

library("rvest")
url <- "http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html"
download_folder <- "C:/R/"
etf_table <- data.frame()

Next thing to determine will be how many pages are there in ETF table. If you visit the url you would find that just above the table where ETFs are displayed, there is string which will tell us how many pages there are. It was 38 when I was writing the script. If you look at the source html, this string appears in paragraph tag whose class is floatsx.

Time to call html_nodes to get the part of html with a paragraph with class floatsx and then run html_text to get the actual string. Then its a matter of taking a substring of complete string to get the number of pages.

#find how many pages are there
download.file(url,paste(download_folder,"ETFs.html",sep=""))
html <- html(paste(download_folder,"ETFs.html",sep=""))
pages <- html_text(html_node(html,"p.floatsx"))
pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))

Now that we know how many pages are there, we want to iterate over each page and get ETF values from the table. Again load up the html and we call html_nodes but this time we are looking at all the tables. On this page there is just one table which displays all the ETF rates. So we are only interested in the first table.

#for each page
for (p in 1:pages) {
 cur_url <- paste(url,"?&page=",p,sep="")
 #download the file
 download.file(cur_url,paste(download_folder,p,".html",sep=""))
 #create html object
 html <- html(paste(download_folder,p,".html",sep=""))
 #look for tables on the page and get the first one
 table <- html_table(html_nodes(html,"table")[[1]])
 #only first 6 columns contain information that we need
 table <- table[1:6]
 #stick a timestamp at end
 table["Timestamp"] <- Sys.time()
 #add into the final results table
 etf_table <- rbind(etf_table,table)
 #remove the originally downloaded file
 file.remove(paste(download_folder,p,".html",sep=""))

 #summary
 summary(etf_table)
}

As you can see, rvest makes scrapping web data extremly simple so give it a try.The markdown file and knitted html is available on github link below if you want to run it in your own environment.
Github link

A warning about SSIS Foreach Loop container to process files from folder

SSIS Foreach Loop Container is frequently used to process files from specific folder. If the names of the files are not known beforehand then the usual way is to process all the files with specific extension. For e.g. all CSV files, or all XLSX files etc.

Untill recently I was under the impression that if you put “*.csv” in the Files textbox of the Collection tab on Foreach Loop editor, SSIS would look for only CSV files. However, this is not true. It appears that when SSIS looks for specfic file types in folder the search is a pattern based search. So if you put *.CSV, it will also process files with extensions like *.CSV_New or *.CSVOriginal.

To test this, I created a folder which contained following files with variations on extension CSV.

1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File4.CSV.ARCHIEVED
5. File5.CSV.ARCHIEVED.201411232359
6. File6.CSV20141129
7. File7.A.CSV
8. File8.csv
9. File9.cSv
10. File10 i.e. no extension

I then created a simple SSIS package with Foreach Loop container which will iterate over these files and script task within it which will show message box with current filename. The Files textbox of the Collection tab on Foreach Loop editor contained *.CSV. On running the package, following files were picked by the container.
1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File6.CSV20141129
5. File7.A.CSV
6. File8.csv
7. File9.cSv

Not a normal scenario to have files with various extensions in same folder but who knows. 🙂
If you are not sure which files would be processed by SSIS Foreach Loop container, the best thing to do would be navigate to the folder in Windows explorere and put the file extension in the search box. Those are the files which SSIS would pick up.

Table variables are fast but #Temp tables are faster!!

This blog is sort of note to myself. Use-temp-tables-instead-of-table-variables-when-the-number-of-records-is-large. There it is and I promise to remember that every time I write DECLARE @something TABLE (… statement.
And now a bit of explanation. I recently worked on project where we added a new fact table to our datawareshouse. An SSIS package was created to get the data from source, do the dimension lookup and populate the fact table. Simple stuff, except that I had to do some data manipulation while selecting data from the source. Now, I could have just written a stored procedure and used that as data source but we are all aware of the pitfalls of that so I decided to use a table valued functions (TVF). TVFs have predictable metadata output and SSIS plays nice with them. However, you cannot use temp tables in TVF so invariably I ended up using a table variable. The package worked perfectly for smaller workloads but when we increased the number of records in table variable, the performance became more and more severe. A casual google search came up with a raft of blog post suggesting that performance culprit might be table variable. So I changed the function to a stored procedure with temp table and used the stored procedure in SSIS. Sure enough our performance issue was resolved like it never existed. Now, there are many blog posts by very intelligent people on why temp tables perform better than table variables but I wanted to check first hand and here’s my little investigation.

1. Create two temp table and poupulated them with random data. These have nothing to do with actual results they are just used for data population.

/* Create base tables which will contain random generated data */
IF OBJECT_ID('tempdb..#basetable') IS NOT NULL
	DROP TABLE #basetable

IF OBJECT_ID('tempdb..#basetable2') IS NOT NULL
	DROP TABLE #basetable2

CREATE TABLE #basetable (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)

CREATE TABLE #basetable2 (
	id INT
	,pdate&nbsp;DATETIME
	);

WITH id
AS (
	SELECT id = 1
	
	UNION ALL
	
	SELECT id + 1
	FROM id
	WHERE id <= 1000
	)
INSERT INTO #basetable
SELECT id
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
FROM id
OPTION (MAXRECURSION&nbsp;0)

;WITH id
AS (
	SELECT id = 1
	
	UNION ALL
	
	SELECT id + 1
	FROM id
	WHERE id <= 1000
	)
INSERT INTO #basetable2
SELECT id
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
FROM id
OPTION (MAXRECURSION 0)

2. Get the records generated in temp table above into two table variable, join them and update the first one with values from the second one.

/* Get the records generated above in two temp variables. 
Join the table variables on id column and update pdate&nbsp;column */
DECLARE @table1&nbsp;TABLE (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)
DECLARE @table2&nbsp;TABLE (
	id INT
	,pdate&nbsp;DATETIME
	)

INSERT INTO @table1
SELECT id
	,pdate
	,sdate
FROM #basetable

INSERT INTO @table2
SELECT id
	,pdate
FROM #basetable2

UPDATE @table1
SET pdate&nbsp;= t2.pdate
FROM @table1&nbsp;t1
LEFT JOIN @table2 t2
	ON t1.id = t2.id

If you see the estimated and actual execution plan below, they are exactly same. SQL Server optimizer decides to use nested loop join because there are no statistics on table variables.
Not to mention this is very slow specially when I increased the number of records to over a million.

Estimated Query Execution plan when using table variable

Estimated Query Execution plan when using table variable

Actual Query Execution plan when using table variables

Actual Query Execution plan when using table variables

3. Get the records generated in temp table from step 1 in another set of temp tables, join them and update the first one with values from second one.

 
/* Get the records generated above in two temp tables. 
Join the two temp tables on id column and update pdate&nbsp;column */

IF OBJECT_ID('tempdb..#table1') IS NOT NULL
	DROP TABLE #table1

IF OBJECT_ID('tempdb..#table2') IS NOT NULL
	DROP TABLE #table2

CREATE TABLE #table1 (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)

CREATE TABLE #table2 (
	id INT
	,pdate&nbsp;DATETIME
	)

INSERT INTO #table1
SELECT id
	,pdate
	,sdate
FROM #basetable

INSERT INTO #table2
SELECT id
	,pdate
FROM #basetable2

UPDATE #table1
SET pdate = t2.pdate
FROM #table1 t1
LEFT JOIN #table2 t2
	ON t1.id = t2.id

Now look at the estimated and actual execution plan. SQL Server estimated using nested loop join but used hash-match which is faster in this situation. SQL Server was able to do this because there are statistics available on temporary tables.

Estimated Query Execution plan when using temp table

Estimated Query Execution plan when using temp table

Actual Query Execution plan when using temp table

Actual Query Execution plan when using temp table

This does not absolutely mean that table variables are not useful in any situation. They are brilliant when number of data in them is small (usually less than 1000 rows as rule of thumb). Creating a temp table in this situation would not give much, if at all, benefits. They are light on transaction log as well since their scope is limited to current execution context and they also lead to fewer stored procedure recompilation if used in stored procedure.
So next time you decide to use table variable have a second thought. Think a bit about the effect on the query execution plan and definitely don’t use them if you care moving large amount of data.

Displaying last cube processed time in SSAS

Recently, while browsing a cube excel, an end-user asked me when was it last processed? I had no quick answer and so had to connect to the SSAS server in management studio and check the properties of the cube. This got me thinking on how to expose the cube’s last processed date and time to end users through Excel or similar client tools. This is very helpful to them in that they can quickly check when was the data last added. Think about it like this. A user builds a pivot table report in Excel with a filter at top which shows when was the cube processed. He then copies the pivot table data and sends to his colleague. His colleague checks the numbers and instantly knows that they are based on data from certain date. If he did not have the last processed date, the assumption can be made that the numbers are based on latest set of data which might not be true. So with that in mind, I decided to implement a simple solution to expose when the cube was last processed and here are the steps.

1. Create table in data warehouse which would have just one column storing date and time when the cube processing finished. This table will always have one record.

2. The record in the table above will be updated with latest timestamp after the cube has been processed. This can be done in the ETL.

3. Import this table in the DSV and create a dimension based on it. This dimension will have just one attribute and one member which will the the last time cube was processed. You might want to be careful with naming of this dimension. It should appear either at the top or bottom in fields list in pivot table so that it does not interfere with the analysis.

4. Set IsAggregatable property of the attribute to False. This is to remove the ‘All’ member because there’s always going to be one member in the dimension attributes,

5. Add this dimension in Dimension Usage matrix. You don’t have to worry about attaching it to any measure group.

6. Add the new dimension to the perspectives if you have any and that’s it.Deploy and process the cube.

To test, connect to the cube in Excel, drag the attribute in report filter and it should show when was the cube last processed. Here’s a little screenshot when I implemented this for Adventure Works 2012 cube. And yes, don’t forget to process the dimension while processing the SSAS database.

CubeLastProcessedDateTime

Filter shows when the cube was last processed

Hope that was helpful. Any suggestions are always welcome.

Large date dimensions in SSAS

I would like to share two simple tips on date dimension if the date range in your date dimension is large and uncertain.

A brief background

I was recently working on a project where I was faced with two situations.

1. The range of dates in the date dimension was large. Something from 1970’s to 2030 and beyond.

2. The source data can contain dates outside of this range i.e. before 1970 and after 2030 but we did not know for certain the exact range.

So here are the tips.

1. Add only required dates 

I could have just populated the date dimension with 200 years worth of dates i.e. 73000 records and be done with it. On closer inspection of source data, however, I found that data in fact table will be sparse for years outside of this range. There would be a few records which would have a date from 1955, a few from 1921 and so on. So why add those extra rows for year 1920 if the only date that is ever going to be used from this year is 01/02/1920.  Even for future dates, why bother adding all the dates from 2070 if ,now, the only date I need is, lets say, 23/09/2070.

To avoid fattening the date dimension, I created a static date range i.e. dates which are most often used. For dates outside of these I created a so-called ‘date sync’ mechanism. In a nutshell, all it does is at the end of dimension load and before the beginning of fact load, it goes through all the date fields in source tables (which are in staging or ODS by now) and makes sure that all dates are present in the date dimension. If they are not, it simply created a row for that particular day. It might seem a slow process but since the data is in relational engine by now, it is quite fast. Plus, it always makes sure that the date will always present in the date dimension so ETL won’t fail due to foreign key constraints.

2. Categories dates

So as mentioned before, our date range was wide so slicing and dicing using date hierarchy was painful because the year started from 1920′ till 2030 and beyond. To make browsing a little less problematic, we introduced a year category field. When we asked the business users, they were most interested in data from last 5 years to next 10 years. So we added a derived column which categorized the dates into various buckets like Pre-2008, 2008… and Post-2024.  We created an attribute based on this fields in date dimension and our date hierarchy looked like this.

DateCategory–>Year–>Month–>Date

Now, when the users dragged date hierarchy on rows in excel, they would see years before 2008 under Pre-2008 then all the years between 2008  and 2024 (which they were most interested in ) and then Post-2024. Nice and clean.

Hopefully these will be helpful to you in some way or might give you some better idea of handling large date dimension. If you have any suggestions, please feel free to drop me a line.

Select distinct values from multiple columns in same table

Selecting a distinct value from a column is no rocket science, but what if you want distinct values from two, three or more columns from a table. I don’t want a distinct combination of values from multiple columns (which will what DISTINCT col1, col2,col3 etc. would do), I want distinct values from first column, then the second, then the third and so on.
The simplest way would be to do SELECT DISTINCT on each column and then UNION them all. But I was looking for something a bit more concise; my table apparently had ten date columns wherein I wanted to find all the unique dates. Enter PIVOT/UNPIVOT twins. If you think about it all we need to do is UNPIVOT the columns we are interested in and then select DISTINCT on them. Simples!. Of course, you would have to cast them to same data type; but I suppose that’s fairly straightforward. So here is an example.

Consider a table like this

Order Number Order Date Payment Date Shipped Date Feedback Date
A123321 01/01/2013 02/01/2013 03/01/2013 10/01/2013
B890098 01/02/2013 06/02/2013 08/02/2013 11/02/2013
C678876 01/01/2013 08/02/2013 11/02/2013 03/03/2013
D342243 07/09/2013 09/09/1023 10/09/2013 11/09/2013

Lets say we want to find all the unique dates in this table. I repeat, we don’t want to find combination of unique dates, which is what DISTINCT would do, we want to find unique dates in all the columns.  As mentioned before, we can do it using DISTINCT & UNION as follows

SELECT DISTINCT [Order Date] FROM [dbo].[Order]


UNION

SELECT DISTINCT [Payment Date] FROM [dbo].[Order]

UNION

and so on….

This can be done concisely using UNPIVOT as follows.

SELECT DISTINCT Dates from
(SELECT [Order Date]
,[Payment Date]
,[Shipped Date]
,[Feedback Date]
FROM [MyDb].[dbo].[Order] ) p
UNPIVOT (Dates FOR UniqueDates IN
([Order Date]
,[Payment Date]
,[Shipped Date]
,[Feedback Date])
) AS unpvt

If the columns had different data types, you would have to cast them but the approach essentially remains same. Hope that was helpful and comments are always welcome.