I would like to share two simple tips on date dimension if the date range in your date dimension is large and uncertain.
A brief background
I was recently working on a project where I was faced with two situations.
1. The range of dates in the date dimension was large. Something from 1970’s to 2030 and beyond.
2. The source data can contain dates outside of this range i.e. before 1970 and after 2030 but we did not know for certain the exact range.
So here are the tips.
1. Add only required dates
I could have just populated the date dimension with 200 years worth of dates i.e. 73000 records and be done with it. On closer inspection of source data, however, I found that data in fact table will be sparse for years outside of this range. There would be a few records which would have a date from 1955, a few from 1921 and so on. So why add those extra rows for year 1920 if the only date that is ever going to be used from this year is 01/02/1920. Even for future dates, why bother adding all the dates from 2070 if ,now, the only date I need is, lets say, 23/09/2070.
To avoid fattening the date dimension, I created a static date range i.e. dates which are most often used. For dates outside of these I created a so-called ‘date sync’ mechanism. In a nutshell, all it does is at the end of dimension load and before the beginning of fact load, it goes through all the date fields in source tables (which are in staging or ODS by now) and makes sure that all dates are present in the date dimension. If they are not, it simply created a row for that particular day. It might seem a slow process but since the data is in relational engine by now, it is quite fast. Plus, it always makes sure that the date will always present in the date dimension so ETL won’t fail due to foreign key constraints.
2. Categories dates
So as mentioned before, our date range was wide so slicing and dicing using date hierarchy was painful because the year started from 1920′ till 2030 and beyond. To make browsing a little less problematic, we introduced a year category field. When we asked the business users, they were most interested in data from last 5 years to next 10 years. So we added a derived column which categorized the dates into various buckets like Pre-2008, 2008… and Post-2024. We created an attribute based on this fields in date dimension and our date hierarchy looked like this.
Now, when the users dragged date hierarchy on rows in excel, they would see years before 2008 under Pre-2008 then all the years between 2008 and 2024 (which they were most interested in ) and then Post-2024. Nice and clean.
Hopefully these will be helpful to you in some way or might give you some better idea of handling large date dimension. If you have any suggestions, please feel free to drop me a line.