Month: September 2013

Select distinct values from multiple columns in same table

Selecting a distinct value from a column is no rocket science, but what if you want distinct values from two, three or more columns from a table. I don’t want a distinct combination of values from multiple columns (which will what DISTINCT col1, col2,col3 etc. would do), I want distinct values from first column, then the second, then the third and so on.
The simplest way would be to do SELECT DISTINCT on each column and then UNION them all. But I was looking for something a bit more concise; my table apparently had ten date columns wherein I wanted to find all the unique dates. Enter PIVOT/UNPIVOT twins. If you think about it all we need to do is UNPIVOT the columns we are interested in and then select DISTINCT on them. Simples!. Of course, you would have to cast them to same data type; but I suppose that’s fairly straightforward. So here is an example.

Consider a table like this

Order Number Order Date Payment Date Shipped Date Feedback Date
A123321 01/01/2013 02/01/2013 03/01/2013 10/01/2013
B890098 01/02/2013 06/02/2013 08/02/2013 11/02/2013
C678876 01/01/2013 08/02/2013 11/02/2013 03/03/2013
D342243 07/09/2013 09/09/1023 10/09/2013 11/09/2013

Lets say we want to find all the unique dates in this table. I repeat, we don’t want to find combination of unique dates, which is what DISTINCT would do, we want to find unique dates in all the columns.  As mentioned before, we can do it using DISTINCT & UNION as follows

SELECT DISTINCT [Order Date] FROM [dbo].[Order]


UNION

SELECT DISTINCT [Payment Date] FROM [dbo].[Order]

UNION

and so on….

This can be done concisely using UNPIVOT as follows.

SELECT DISTINCT Dates from
(SELECT [Order Date]
,[Payment Date]
,[Shipped Date]
,[Feedback Date]
FROM [MyDb].[dbo].[Order] ) p
UNPIVOT (Dates FOR UniqueDates IN
([Order Date]
,[Payment Date]
,[Shipped Date]
,[Feedback Date])
) AS unpvt

If the columns had different data types, you would have to cast them but the approach essentially remains same. Hope that was helpful and comments are always welcome.

Advertisements

Why I don’t buy the Big Data “Red Hat of Hadoop” Story

Great read!!

Emergent Business Networks

It’s official. As of 19 August 2013, Big Data has officially past the “peak of inflated expectations” and is hurtling down the rollercoaster ride to the “trough of disillusionment”. Hold on tight boys and girls, this will be a white knuckle ride. This is the point when the rollercoaster reaches the top and you sense the change and then hear the screams as you hurtle down.

Its official because Gartner, who coined the hype cycle, declared it so. They maybe right or wrong, but it’s certainly official (and official does impact reality in Enterprise-Land).

Its not your first time on a rollercoaster, so you knew the top was coming, right? Plenty of smart folks have been predicting this. Robin Bloor for example was articulate, analytical and ahead of the curve on this in 2012.

The problem with the Big Data hype phase was that most of the energy was used…

View original post 805 more words

To Truncate/Reload or Not to Truncate/Reload, that is the question.

I have seen many a data warehouses which are completely emptied very night and reloaded again. This is a perfectly fine ETL strategy. Many would argue that those are not data warehouses in strict sense but I will leave it  out from this blog.

The other strategy is, obviously, to do incremental load or merge load where data is incrementally modified in the data warehouse (I will stick to calling it Merge strategy). So which one is better? Well, like answers to most questions in IT, “it depends”. Here are few counter questions I ask before deciding on the strategy. They are in no particular order and certainly not ordered by importance. Before making any decision, each one needs to be considered carefully.

How much data are you expecting from the source?

With smaller data sizes, Truncate/Reload generally performs better, plus ETL tends to be simpler. For larger data sizes: it’s difficult to say which one will perform better but I have found that in the longer run Merge starategy pays off in terms of performance.

How many data sources are there?

With multiple data sources, I incline towards Merge. I can import raw data from all the sources in one central place (variedly called imports database, staging database, holding database, raw_datastore etc.), baseline it and then continue with data loading to data warehouse. If for some reason one of the data sources is not available, I can handle that gracefully in ETL and when the data does becomes available I can retrospectively amend it. With Truncate/Reload, it kind of become manadatory that all the data sources are available during ETL and data across them is synchronised.

What are the types of the data sources?

If the data source is in the form of flat file, doing Truncate/Reload may not be such a good idea. What if the file becomes unavailable during ETL? Also reading from set of files everyday, doesn’t sound right.

Is ETL data source a complete data set or delta load?

Obviously, if the data source is delta load, then Merge is only option.

Do you need to maintain history?

If there are Type II and Type III attributes in data warehouse and data sources doesn’t store audit information, then Merge is the only option. A different take on the same question is, ‘Would there be a future need to show historical data?’ ,in which case, I use Merge even though upfront ETL efforts are more.

How complex are the ETL rules?

If ETL only consists of moving data from source into  star schema structure to ease reporting, Truncate/Reload can be a suitable options. However, for complex ETL, I prefer Merge .

Does the source have data modification timestamps?

Having these can be very helful in Merge strategy.

Would you need to extend the solution?

If there are chances of adding more data sources, the Merge is better option. It’s usually difficult to add new data source if ETL is designed around Truncate/Reload strategy.

How much is your ETL operation window?

In most cases, this isn’t an issue but still worthwhile considering upfront.

Hope that was helpful. Feel free to let me know any other issues that you might consider before selecting your ETL strategy.