ETL

Creating a new SSIS package? Have you thought about these things?

Creating a package in SSIS is easy but creating a “good” SSIS package is a different story. As developers, we tend to jump right into building and creating that wonderfully simple package and often overlook the nitty-gritties. Being an avid developer myself, I must confess that I have fallen prey to this from time to time. When I find myself in a rush to create “that” package, I take a step back and ask myself “have you thought about these things?”. The list below is not comprehensive but it’s a good starting point. I will hopefully keep it updated as I come across more issues. Please note that it is not a SSIS best practices or SSIS optimization tips; these are more of high level things which I keep at the back of my mind whenever I am creating a new SSIS package.

1. SSIS Configuration
Configuration has to be the No.1 thing to think about when you start creating any SSIS package. It governs how the package will run in multiple environments so it is absolutely necessary to pay particular attention to configuration. Some of the questions to ask yourself are

  • Where is configuration stored? Is it XML, dtcConfig file, SQL Server?
  • How easy it is to change configured values?
  • What values are you going to store in configuration? Is it just connection manager or should you be storing any variable values as well.
  • What will happen if the package does not find a particular configured item? Would it fail? Would it do something it should not be doing?
  • How are you storing the passwords if there are any?
  • and so on..

2. SSIS Logging

There is certain minimum information each SSIS package must log. It is not only a good practice but it will make life much easier when package fails in production and you want to know where and why it failed. As rule of thumb, I think the following should be logged

  • The start and end of the package
  • The start and end of each task
  • Any errors on tasks. You should log as much information as possible about the error such as the error message, variable values when the error occurred, server names, file names under processing etc.
  • Row counts in the data flow

Which log provider to use is entirely up to you although I tend to create the log in a SQL Server database because it is easier to query that way.

3. Package restartability

Can you re-run the package as many times as you want? What if the package fails in between the operation? Would it start from where it failed? If it starts from the beginning what would it do?

4. Is your package atomic?

By “atomic”, I mean is it doing just one operation like “load date” or “load customer” or is it doing multiple operations like “load date and update fact”. It is always a good idea to keep packages atomic. This helps in restartability besides helping while debugging the ETL. If you think your package is doing multiple operations in one go, split it into multiple packages.

5. Are you using the correct SSIS tasks?

There are tasks in which SSIS is good at and there are tasks in which databases are good at. For example, databases are good at JOIN operations whereas SSIS can connect to an FTP site with ease. Are you using the optimum task? Can your current operation be done in pure TSQL? If yes, push it to the database.

6. Are you using event handlers?

Event handlers are great if you want to take alternative actions on certain events. For example, if the package fails, OnError event handler can be used to reset tables or notify somebody.

7. Have you thought about data source?

How are you getting data from data source? Is it the best way? Can you add a layer of abstraction between data source and your SSIS package? If you are reading from a relational database, can you create views on it rather than hard-coded SQL queries? If you are reading from flat files, have you set the data type correctly?

8. Naming convention

Is your package aptly named? Does it do what it says on the tin? Does it convey meaningful information about what the package is doing?

Same rules also apply to variables in the package.

9. SSIS Task Names

Have you renamed SSIS Tasks and are they descriptive enough to convey meaning of the operation they are performing?

10. Documentation/Annotations

Is your package well documented? Does it describe WHY it is doing something rather than WHAT it is doing? The former is considered a good documentation although in case of SSIS I find that even the later is very helpful because any new person doesn’t have to go through the package to understand what it is doing. SSIS annotations are great for in-package documentation and can be used effectively.

11. Is your package well structured both operationally and visually?

Can you box tasks into a series sequence containers? Does your package looks like a nice flow either from top-to-bottom or left-to-right? Are there any tasks which are hidden beneath other? Can the person looking at the package for first time grasp what’s happening without digging into each task?

12. Are you using an ETL framwork?

Having a generalized ETL framework will save a significant amount of time because many of the repetitive tasks such as logging, event handlers, configuration can be created in one “template” package and all other packages can be based on this template package.

Please leave a comment about what you think and if there is anything that you always keep in my mind when developing in SSIS. An older post of mine about things to be aware of while developing SSAS cubes is one of favorite and I can see this becoming one too!!

Advertisements

A warning about SSIS Foreach Loop container to process files from folder

SSIS Foreach Loop Container is frequently used to process files from specific folder. If the names of the files are not known beforehand then the usual way is to process all the files with specific extension. For e.g. all CSV files, or all XLSX files etc.

Untill recently I was under the impression that if you put “*.csv” in the Files textbox of the Collection tab on Foreach Loop editor, SSIS would look for only CSV files. However, this is not true. It appears that when SSIS looks for specfic file types in folder the search is a pattern based search. So if you put *.CSV, it will also process files with extensions like *.CSV_New or *.CSVOriginal.

To test this, I created a folder which contained following files with variations on extension CSV.

1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File4.CSV.ARCHIEVED
5. File5.CSV.ARCHIEVED.201411232359
6. File6.CSV20141129
7. File7.A.CSV
8. File8.csv
9. File9.cSv
10. File10 i.e. no extension

I then created a simple SSIS package with Foreach Loop container which will iterate over these files and script task within it which will show message box with current filename. The Files textbox of the Collection tab on Foreach Loop editor contained *.CSV. On running the package, following files were picked by the container.
1. File1.CSV
2. File2.CSV_ARCHIEVED
3. File3.CSVARCHIEVED
4. File6.CSV20141129
5. File7.A.CSV
6. File8.csv
7. File9.cSv

Not a normal scenario to have files with various extensions in same folder but who knows. 🙂
If you are not sure which files would be processed by SSIS Foreach Loop container, the best thing to do would be navigate to the folder in Windows explorere and put the file extension in the search box. Those are the files which SSIS would pick up.

To Truncate/Reload or Not to Truncate/Reload, that is the question.

I have seen many a data warehouses which are completely emptied very night and reloaded again. This is a perfectly fine ETL strategy. Many would argue that those are not data warehouses in strict sense but I will leave it  out from this blog.

The other strategy is, obviously, to do incremental load or merge load where data is incrementally modified in the data warehouse (I will stick to calling it Merge strategy). So which one is better? Well, like answers to most questions in IT, “it depends”. Here are few counter questions I ask before deciding on the strategy. They are in no particular order and certainly not ordered by importance. Before making any decision, each one needs to be considered carefully.

How much data are you expecting from the source?

With smaller data sizes, Truncate/Reload generally performs better, plus ETL tends to be simpler. For larger data sizes: it’s difficult to say which one will perform better but I have found that in the longer run Merge starategy pays off in terms of performance.

How many data sources are there?

With multiple data sources, I incline towards Merge. I can import raw data from all the sources in one central place (variedly called imports database, staging database, holding database, raw_datastore etc.), baseline it and then continue with data loading to data warehouse. If for some reason one of the data sources is not available, I can handle that gracefully in ETL and when the data does becomes available I can retrospectively amend it. With Truncate/Reload, it kind of become manadatory that all the data sources are available during ETL and data across them is synchronised.

What are the types of the data sources?

If the data source is in the form of flat file, doing Truncate/Reload may not be such a good idea. What if the file becomes unavailable during ETL? Also reading from set of files everyday, doesn’t sound right.

Is ETL data source a complete data set or delta load?

Obviously, if the data source is delta load, then Merge is only option.

Do you need to maintain history?

If there are Type II and Type III attributes in data warehouse and data sources doesn’t store audit information, then Merge is the only option. A different take on the same question is, ‘Would there be a future need to show historical data?’ ,in which case, I use Merge even though upfront ETL efforts are more.

How complex are the ETL rules?

If ETL only consists of moving data from source into  star schema structure to ease reporting, Truncate/Reload can be a suitable options. However, for complex ETL, I prefer Merge .

Does the source have data modification timestamps?

Having these can be very helful in Merge strategy.

Would you need to extend the solution?

If there are chances of adding more data sources, the Merge is better option. It’s usually difficult to add new data source if ETL is designed around Truncate/Reload strategy.

How much is your ETL operation window?

In most cases, this isn’t an issue but still worthwhile considering upfront.

Hope that was helpful. Feel free to let me know any other issues that you might consider before selecting your ETL strategy.

Microsoft Data Explorer for Excel 2010/2013

Chris Webb posted about Data Explorer being available for public preview and I thought I will give it a quick spin. You can download it here . The installation is pretty straightforward. After the installation, when you open Excel (I am using Excel 2010), you would see a new Data Explorer tab like below.

Data Explorer Tab in Excel 2010

Data Explorer Tab in Excel 2010

Data Explorer can import data from a number of sources including Facebook (How cool is that!!!). The complete list of sources and how-to is here. For the purpose of this small exercise, I am using data from Lending Club. It is available to download from their website.
Once the data source is et up and connected, new query screen shows up and this is where many of the transformations can be done. I am going to list few here which can be of use to general users.

1. Change Type of column
By default the data type of the column is set to Auto-Detect. You can change it to your specific data type by right-clicking on the column and selecting the type from Change Type option.

Change Data Type of Column in Data Explorer

Change Data Type of Column in Data Explorer

2. Transformations
Similar to changing the data type, you can do some simple transformations on the columns. To do this, right click on the column in the query window and select Transform. The options in Transform menu will depend on the data type of the column. For numeric columns, this option is greyed out.

Data Transform Options in Data Explorer

Data Transform Options in Data Explorer

3. Adding a new column
New columns can be very easily added and formulas can be used. The complete formula reference is listed on Office help page.
For example, in the below screenshot I am adding a new column which will take the lower range of FICO Score (FICO score is like credit score in US).

Add a new column in Data Explorer

Add a new column in Data Explorer

4. Undo transform
One of the nice features was the ability to Undo a transform (oh!! how I miss the undo button in SSIS). On the left hand side, you would see Steps section. If you expand that, it will list all the steps you have taken so far to transform data. Clicking the X against it will undo the transform you did in that step. If you try to remove a step other than the last step, you get a warning.
Some other handy transforms are Split, Replace Values and Group By and I think they are self explanatory.

I did face an error though while doing some transformations. The file mentioned above contained 11428 records. After doing some transformations, I was trying to save the Query and get the data in a workbook, at which point I was persistently getting the following error. Not to mention, I sent a lot of frowns to MS. What worked was, using a sample of rows to do transformations and then import all in the workbook.

Error in Data Explorer After saving query

Error in Data Explorer After saving query

All in all, it’s a nice utility to do some simple transformations and combine data from different sources in excel. Soon I am planning to take look at how it works with Facebook data.