BI

Changing role of IT in the BI world

BI is Dead, Long live BI

Timo Elliot (blog|twitter) recently published a blog BI is Dead which draws from Gartner’s Magic Quadrant report and a detailed report.

The main take away from the post (which includes references from the Gartner research ,so not all are Timo’s points) include:

  •  The self service analytics tools have matured enough that the involvement of IT in BI and analytics projects is deemed optional. IT don’t need to model the data upfront (which needed gathering all or partial analytics requirements to start with), analysts can prototype and test it themselves.
  •  The balance of power for BI and analytics platform buying decisions has gradually shifted from IT to business.

  • BI and analytics tools which do require intervention from IT are not considered BI anymore; they are Enterprise Reporting Based Platforms. Admittedly, they take most share of the BI market. In other words Gartner has updated the definition of BI.

  • The bold headline – “BI is Dead” – is to gain attention on the changing landscape of BI tools.

  •  Organizations who do not embrace the new definition of BI, run the danger of turning into BI-nosaurs.

  •  Having single view of data through an EDW is pointless/extremely hard. This and this from Curt Monash also support this line of thinking.

  •  Most organizations believe that IT has role to play in BI although majority want the responsibility of authoring the content to end users. This has always been the holy grail of BI.

So it is really dead?

Well-not necessarily. It’s dead in the sense that PCs are dead as we know them from 1980 or before. PCs have clearly evolved: they don’t look like how they used to look and many aspects of PC have been democratized. Trivial things which needed programmer in those days can now be done by users themselves. Heck, you can even upgrade the hardware if you are into that sort of thing.

I think that’s exactly what is happening in the BI and analytics world. The self service analytics tools have evolved to the point that many of the trivial data munging tasks can be done by the analysts themselves and what they can do with these new tools does not look at all like how they have been doing it. Does that mean the BI is dead? No- the fundamental analytics is still the same. It has evolved just like the PCs and the role of IT has changed.

Role of IT in the new BI world

The changing role of IT in BI and analytics can be best described from picture below.

Changing role of IT in the BI world

IT is now viewed as (and rightly so) as data facilitators. They make data available to the analysts in palatable form. The responsibility of modelling it,authoring presentation components and analysis lies with analysts.

Does that mean the job of IT got easier?

Most-Definitely-Not, on the contrary it has got even more complicated. With the influx of new tools, the demand for data has dramatically increased. The analysts want to analyse all sorts of data, from all sorts of unlikely sources and with all sorts of analytical methods. Their expectations from IT is to facilitate this process which is where ITs job has got a lot harder. We have to deal with ever increasing data sizes, from ever expanding sources, and support ever changing analytical tools.

What else is keeping BI alive?

If we assume the following definition of BI,then here are more reasons why BI is definitely not dead in its current incarnation.

BI
:
an umbrella term to describe “concepts and methods to improve business decision making by using fact-based support systems”

1. Not everybody is data scientist

a. Combining data from multiple sources, b. creating data model and c.authoring reports from it, need special skillset. Many a times, analysts just want an Excel connecting to OLAP cube to do their analysis. Traditional BI has a place in this space.

2. Robust production ready solutions

The analysis done in the analysts R code or ipython notebook is not usually production ready. BI will be very much needed when it needs to be productionised.

3. If ETL is part of BI then its here to stay

How else would you provide quality data to analysts otherwise? And if the definition of BI includes methods to improve process of decision making, then we need BI.

4. Operational reporting is a fact of life

As much as we harp about ad-hoc analysis, data science and self-service BI, plain old operational reporting is a big part of any organization. For that we need classic BI.

Advertisements

Creating a new SSIS package? Have you thought about these things?

Creating a package in SSIS is easy but creating a “good” SSIS package is a different story. As developers, we tend to jump right into building and creating that wonderfully simple package and often overlook the nitty-gritties. Being an avid developer myself, I must confess that I have fallen prey to this from time to time. When I find myself in a rush to create “that” package, I take a step back and ask myself “have you thought about these things?”. The list below is not comprehensive but it’s a good starting point. I will hopefully keep it updated as I come across more issues. Please note that it is not a SSIS best practices or SSIS optimization tips; these are more of high level things which I keep at the back of my mind whenever I am creating a new SSIS package.

1. SSIS Configuration
Configuration has to be the No.1 thing to think about when you start creating any SSIS package. It governs how the package will run in multiple environments so it is absolutely necessary to pay particular attention to configuration. Some of the questions to ask yourself are

  • Where is configuration stored? Is it XML, dtcConfig file, SQL Server?
  • How easy it is to change configured values?
  • What values are you going to store in configuration? Is it just connection manager or should you be storing any variable values as well.
  • What will happen if the package does not find a particular configured item? Would it fail? Would it do something it should not be doing?
  • How are you storing the passwords if there are any?
  • and so on..

2. SSIS Logging

There is certain minimum information each SSIS package must log. It is not only a good practice but it will make life much easier when package fails in production and you want to know where and why it failed. As rule of thumb, I think the following should be logged

  • The start and end of the package
  • The start and end of each task
  • Any errors on tasks. You should log as much information as possible about the error such as the error message, variable values when the error occurred, server names, file names under processing etc.
  • Row counts in the data flow

Which log provider to use is entirely up to you although I tend to create the log in a SQL Server database because it is easier to query that way.

3. Package restartability

Can you re-run the package as many times as you want? What if the package fails in between the operation? Would it start from where it failed? If it starts from the beginning what would it do?

4. Is your package atomic?

By “atomic”, I mean is it doing just one operation like “load date” or “load customer” or is it doing multiple operations like “load date and update fact”. It is always a good idea to keep packages atomic. This helps in restartability besides helping while debugging the ETL. If you think your package is doing multiple operations in one go, split it into multiple packages.

5. Are you using the correct SSIS tasks?

There are tasks in which SSIS is good at and there are tasks in which databases are good at. For example, databases are good at JOIN operations whereas SSIS can connect to an FTP site with ease. Are you using the optimum task? Can your current operation be done in pure TSQL? If yes, push it to the database.

6. Are you using event handlers?

Event handlers are great if you want to take alternative actions on certain events. For example, if the package fails, OnError event handler can be used to reset tables or notify somebody.

7. Have you thought about data source?

How are you getting data from data source? Is it the best way? Can you add a layer of abstraction between data source and your SSIS package? If you are reading from a relational database, can you create views on it rather than hard-coded SQL queries? If you are reading from flat files, have you set the data type correctly?

8. Naming convention

Is your package aptly named? Does it do what it says on the tin? Does it convey meaningful information about what the package is doing?

Same rules also apply to variables in the package.

9. SSIS Task Names

Have you renamed SSIS Tasks and are they descriptive enough to convey meaning of the operation they are performing?

10. Documentation/Annotations

Is your package well documented? Does it describe WHY it is doing something rather than WHAT it is doing? The former is considered a good documentation although in case of SSIS I find that even the later is very helpful because any new person doesn’t have to go through the package to understand what it is doing. SSIS annotations are great for in-package documentation and can be used effectively.

11. Is your package well structured both operationally and visually?

Can you box tasks into a series sequence containers? Does your package looks like a nice flow either from top-to-bottom or left-to-right? Are there any tasks which are hidden beneath other? Can the person looking at the package for first time grasp what’s happening without digging into each task?

12. Are you using an ETL framwork?

Having a generalized ETL framework will save a significant amount of time because many of the repetitive tasks such as logging, event handlers, configuration can be created in one “template” package and all other packages can be based on this template package.

Please leave a comment about what you think and if there is anything that you always keep in my mind when developing in SSIS. An older post of mine about things to be aware of while developing SSAS cubes is one of favorite and I can see this becoming one too!!

Web scraping in R using rVest

I am not much conversant with web scraping but I undersand the importance of the technique given the fact that a lot of very useful data is embedded in HTML pages. Hence I was very excited when I came across this blog post on rstudio site which introduced a new package called rvest for web scraping. The github repository of the package is here.

As an excersie in scraping web pages, I set out to get all the Exchange Traded Funds (ETF) data from London Stock Exchange web site.

First things first, load up the rvest package and set out the base url, a download location where the html will be saved. You can do this without having to download the file but there were same proxy setting in the environment I was working on which prevented me from doing this. So I opted to download the html, process it and then to delete it.

library("rvest")
url <- "http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html"
download_folder <- "C:/R/"
etf_table <- data.frame()

Next thing to determine will be how many pages are there in ETF table. If you visit the url you would find that just above the table where ETFs are displayed, there is string which will tell us how many pages there are. It was 38 when I was writing the script. If you look at the source html, this string appears in paragraph tag whose class is floatsx.

Time to call html_nodes to get the part of html with a paragraph with class floatsx and then run html_text to get the actual string. Then its a matter of taking a substring of complete string to get the number of pages.

#find how many pages are there
download.file(url,paste(download_folder,"ETFs.html",sep=""))
html <- html(paste(download_folder,"ETFs.html",sep=""))
pages <- html_text(html_node(html,"p.floatsx"))
pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))

Now that we know how many pages are there, we want to iterate over each page and get ETF values from the table. Again load up the html and we call html_nodes but this time we are looking at all the tables. On this page there is just one table which displays all the ETF rates. So we are only interested in the first table.

#for each page
for (p in 1:pages) {
 cur_url <- paste(url,"?&page=",p,sep="")
 #download the file
 download.file(cur_url,paste(download_folder,p,".html",sep=""))
 #create html object
 html <- html(paste(download_folder,p,".html",sep=""))
 #look for tables on the page and get the first one
 table <- html_table(html_nodes(html,"table")[[1]])
 #only first 6 columns contain information that we need
 table <- table[1:6]
 #stick a timestamp at end
 table["Timestamp"] <- Sys.time()
 #add into the final results table
 etf_table <- rbind(etf_table,table)
 #remove the originally downloaded file
 file.remove(paste(download_folder,p,".html",sep=""))

 #summary
 summary(etf_table)
}

As you can see, rvest makes scrapping web data extremly simple so give it a try.The markdown file and knitted html is available on github link below if you want to run it in your own environment.
Github link

Faster SSAS Processing by using Shared Memory Protocol

I came across this very handy tip to increase SSAS processing speed in SQL Server 2008 R2 Analysis Services Operations Guide and thought it is worth sharing.
The guide recommendes using Shared Memory Protocol when getting data from the relational data source if it is SQL Server and both SSAS and SQL Server are on the same physical box. Exchaning data over shared memory is much faster than over TCP/IP as you probably already know. You will need to perform two steps to force SSAS data source to use shared memory while querying underlying SQL Server.
1. Check that Shared Memory Protocol is enabled in SQL Server configuration manager.
2. Prefix the data source connection string in SSAS with :lpc like below.

Provider=SQLNCLI10.1;Data Source=lpc:ThisServer\INST1;Integrated Security=SSPI;Initial Catalog=WordMatch

The guide claims to have achieved 112,000 rows per second using TCP/IP where as 180,000 rows per second using shared memory which is impressive. In my own test, a slightly modified Adventure Works cube took 56 seconds to process using TCP-IP whereas 47 seconds using shared memory; an improvement of 16%.

99 Problems in R

In my Introduction to R post, I introduced R and provided some resources to learn it. I am learning R myself and finding the learning curve a bit steep. Anyway, the best way to learn a new programming language is to practice as much as possible. So inspired by 99 Problems in various languages (links below), I am creating ’99 Problems in R’ set. The project is on github. I am new to github but finding it easy to share code through github rather than here on the blog. Hopefully in future, I would make more use of github.
The files are in *.rmd format which can be opened in R Studio. I have also added knitted HTML files. The git repo is here.
https://github.com/saysmymind/99-Problems-R
Be warned that the solutions to problems are written by me; an amateur R programmer, so there might be better way of solving some of them. I will try to solve more problems and keep adding them to the repo. In the mean time, feel free to do pull request and peek at code.

I wish I can say ‘I got 99 problems but R ain’t one’ but alas I am not there yet. 🙂
99 Haskell Problems 

99 Python Problems

99 Prolog Problems

99 LISP Problems

99 Perl 6 Problems

99 OCaml Problems

Calculated distinct count measures in SSAS

Distinct count measures are fact-of-life for an SSAS developer. No matter how much we try to avoid them they are always asked for and we have to deal with them somehow. Another painful fact is, as you might already know, we cannot create multiple distinct count measures in the same measure group. So each distinct count measure has to sit in its own measure group which,IMO, does not look right when browsing the cube. In this post, I want to show a method of creating distinct count calculated measures which I found on Richard Lees blog here with slight modification.
http://richardlees.blogspot.co.uk/2008/10/alternative-to-physical-distinct-count.html

Using Adventure Works, let’s say the end users want to know the distinct count of customers who have placed orders on the internet. I can add a calculation like this in the cube

CREATE MEMBER CURRENTCUBE.[Measures].[Unique Customers]
AS COUNT(NONEMPTY([Customer].[Customer].[Customer].members,
[Measures].[Internet Order Count])),
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Orders' ;

This is all fine and dandy, however, as soon as I added any attribute from customer dimension on the rows or filters, the results were showing incorrect values i.e. the same count of customers was repeated for all records.

The solution is to count the number of unique customers in the context of current attributes of customer dimension. For examples sake, lets take Customer City attribute. I tweaked the calculation like below to count customers only in the context of current members in City attributes and it started working as expected when Customer City attribute is used in rows, columns or filters.

CREATE MEMBER CURRENTCUBE.[Measures].[Unique Customers]
AS COUNT(NONEMPTY(
CROSSJOIN(
[Customer].[Customer].[Customer].members,
[Customer].[City].CurrentMember
),[Measures].[Internet Order Count])),
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Orders' ;

Of course, you will have to add all the dimension attributes in the CROSSJOIN but ultimately a calculated, though complex, distinct count measure is better than having a number of physical distinct count measures IMHO.

 

To Truncate/Reload or Not to Truncate/Reload, that is the question.

I have seen many a data warehouses which are completely emptied very night and reloaded again. This is a perfectly fine ETL strategy. Many would argue that those are not data warehouses in strict sense but I will leave it  out from this blog.

The other strategy is, obviously, to do incremental load or merge load where data is incrementally modified in the data warehouse (I will stick to calling it Merge strategy). So which one is better? Well, like answers to most questions in IT, “it depends”. Here are few counter questions I ask before deciding on the strategy. They are in no particular order and certainly not ordered by importance. Before making any decision, each one needs to be considered carefully.

How much data are you expecting from the source?

With smaller data sizes, Truncate/Reload generally performs better, plus ETL tends to be simpler. For larger data sizes: it’s difficult to say which one will perform better but I have found that in the longer run Merge starategy pays off in terms of performance.

How many data sources are there?

With multiple data sources, I incline towards Merge. I can import raw data from all the sources in one central place (variedly called imports database, staging database, holding database, raw_datastore etc.), baseline it and then continue with data loading to data warehouse. If for some reason one of the data sources is not available, I can handle that gracefully in ETL and when the data does becomes available I can retrospectively amend it. With Truncate/Reload, it kind of become manadatory that all the data sources are available during ETL and data across them is synchronised.

What are the types of the data sources?

If the data source is in the form of flat file, doing Truncate/Reload may not be such a good idea. What if the file becomes unavailable during ETL? Also reading from set of files everyday, doesn’t sound right.

Is ETL data source a complete data set or delta load?

Obviously, if the data source is delta load, then Merge is only option.

Do you need to maintain history?

If there are Type II and Type III attributes in data warehouse and data sources doesn’t store audit information, then Merge is the only option. A different take on the same question is, ‘Would there be a future need to show historical data?’ ,in which case, I use Merge even though upfront ETL efforts are more.

How complex are the ETL rules?

If ETL only consists of moving data from source into  star schema structure to ease reporting, Truncate/Reload can be a suitable options. However, for complex ETL, I prefer Merge .

Does the source have data modification timestamps?

Having these can be very helful in Merge strategy.

Would you need to extend the solution?

If there are chances of adding more data sources, the Merge is better option. It’s usually difficult to add new data source if ETL is designed around Truncate/Reload strategy.

How much is your ETL operation window?

In most cases, this isn’t an issue but still worthwhile considering upfront.

Hope that was helpful. Feel free to let me know any other issues that you might consider before selecting your ETL strategy.