DW

To Truncate/Reload or Not to Truncate/Reload, that is the question.

I have seen many a data warehouses which are completely emptied very night and reloaded again. This is a perfectly fine ETL strategy. Many would argue that those are not data warehouses in strict sense but I will leave it  out from this blog.

The other strategy is, obviously, to do incremental load or merge load where data is incrementally modified in the data warehouse (I will stick to calling it Merge strategy). So which one is better? Well, like answers to most questions in IT, “it depends”. Here are few counter questions I ask before deciding on the strategy. They are in no particular order and certainly not ordered by importance. Before making any decision, each one needs to be considered carefully.

How much data are you expecting from the source?

With smaller data sizes, Truncate/Reload generally performs better, plus ETL tends to be simpler. For larger data sizes: it’s difficult to say which one will perform better but I have found that in the longer run Merge starategy pays off in terms of performance.

How many data sources are there?

With multiple data sources, I incline towards Merge. I can import raw data from all the sources in one central place (variedly called imports database, staging database, holding database, raw_datastore etc.), baseline it and then continue with data loading to data warehouse. If for some reason one of the data sources is not available, I can handle that gracefully in ETL and when the data does becomes available I can retrospectively amend it. With Truncate/Reload, it kind of become manadatory that all the data sources are available during ETL and data across them is synchronised.

What are the types of the data sources?

If the data source is in the form of flat file, doing Truncate/Reload may not be such a good idea. What if the file becomes unavailable during ETL? Also reading from set of files everyday, doesn’t sound right.

Is ETL data source a complete data set or delta load?

Obviously, if the data source is delta load, then Merge is only option.

Do you need to maintain history?

If there are Type II and Type III attributes in data warehouse and data sources doesn’t store audit information, then Merge is the only option. A different take on the same question is, ‘Would there be a future need to show historical data?’ ,in which case, I use Merge even though upfront ETL efforts are more.

How complex are the ETL rules?

If ETL only consists of moving data from source into  star schema structure to ease reporting, Truncate/Reload can be a suitable options. However, for complex ETL, I prefer Merge .

Does the source have data modification timestamps?

Having these can be very helful in Merge strategy.

Would you need to extend the solution?

If there are chances of adding more data sources, the Merge is better option. It’s usually difficult to add new data source if ETL is designed around Truncate/Reload strategy.

How much is your ETL operation window?

In most cases, this isn’t an issue but still worthwhile considering upfront.

Hope that was helpful. Feel free to let me know any other issues that you might consider before selecting your ETL strategy.

Building data model for SSAS Tabular

I have started preparing for Microsoft Certifications on SQL Server 2012 and I am spending a fair amount of time with SSAS Tabular since it’s the newest addition to MS BI stack and the one which I am not very accoustomed to. As I work with it more and more, I am starting to think about how would you go around designing the database on which the tabular model will be based. In the multi-dimensional world, this was an easy answer. You would almost always design the DW in Star/Snow-flake schema or atleast expose the DW as Star/Snow-flake schema through SQL Server Views or Data Source View (DSV). In SSAS Tabular, it appears to me, that this not really necessary.

Do I really need to segregate data in facts and dimension?
I can create measure out of any column in any table in SSAS Tabular. It doesn’t have to be from a fact table so why separate out facts and dimensions? Assume that I have a ‘House’ dimension with ‘NumberOfRooms’ as attribute. In SSAS Multidimensional, if the users wanted to slice and dice ‘NumberOfRooms’ by other attributes of house dimension or any other dimension, you would have to some how surface this as a fact table. SSAS Tabular will happily allow you to create measure out of this column although it’s in a dimension table. In theory, you can point the tabular model to the normalized OLTP database and still be able to give users the ability to slice and dice the data just like SSAS Multidimensional.

Normalized Or Denormalized, that is the question..
There would be obvious issues with building the tabular model on a normalized schema.One, since the database is normalized, the sheer number of tables will be very large. Two, maintaining relationships between the tables would be painful.Three, the model itself can be confusing for end users. On the other hand, building a fully denormalized schema when you don’t really need it adds complexity to the project, especially in the ETL. I don’t think I need to reiterate that the complex part of DW project is ETL and around 60-70% effort are spent on this phase. So it appears like we need some middle ground.

Enter Entity Based design..
I have always perceived the DW as a collection of entities like product, sales, employee etc.  which have relationship with each other. The star schema is similar but the dimensions don’t have relationship with each other directly. They are related to each other through the fact table. Plus, the dimensions are denormalized so as much as possible information is crammed into them. I guess what I am trying to say is we don’t need a fully denormalized structure and at the same time we don’t want the datawarehouse in 3NF. We would need to find a middle ground so that it is just about corretly denormalized and normalized at the same time. In my opinion, the way to get to this is to think of DW in terms of entities and how they interact with each other rather than classifying the data into dimension and facts.

Of course, these are all my prilimanary thoughts and are very much open to issues. I guess we would have to wait for wide spread adoption of the tabular model and learn from people who did it.

Changing workspace database name in SSAS 2012 Tabular

A word of caution. Before you try to implement any steps mentioned in this entry, I would like to point out that it is not recommend. Moreover, I do not take any responsibility if this causes any issues in your project. 

A tabular model workspace database is created automatically when you create a new  SQL Server Tabular Project in SQL Server Data Tools (SSDT). By default it is created on local SSAS Tabular server but you can choose a different server as well. The workspace database name consist of the project name, user name ,GUID and frankly it looks quite gibberish.  For example look at the screenshot of my SSMS on my test virtual machine. Yikes!! Now imagine this on a server with 10 concurrent users, each with 10 projects and you will get the idea.

So I was wondering if I can somehow rename them. You know something pretty. And as it turn out, I can. So here are the steps.

1. Starts SSTD (obviously!!) and create a new Analysis Services Tabular Project. At this point I would also recommend that you rename the default ‘Model.bim’ to something more meaningful to your project.

2. Right click on bim file and click properties.

3. Apart from all the other properties, you will find Workspace Database property in the bottom portion. The description of the properties is here. If you open SSMS and connect to SSAS Tabular instance, you should find a database with the same name as this property value. As you would notice, Workspace Database cannot be changed. It is read-only and that’s the one we want to change.

4. Now right click on project ,click Open Folder in Windows Explorer and close SSDT.

5. In windows explorer, there would be *.settings file. The format of this file seems to be ModelName.bim_username.settings. It might be hidden so change the folder settings to show hidden files.

6Open this file in notepad. It is an xml although everything will come on one line. I would also recommend that you don’t try to format the file.

7. Now look for <DatabaseName> </DatabaseName> tag. The value between this tag is name of the database. Change this value to something more meaningful, save and close notepad.

8. Open the project/solution in SSDT. Check the properties of bim file. The value of property Workspace Database would be what you set in Step 7.

9. You can also open SSMS and verify this.

Neat, isn’t it??