Month: February 2013

Microsoft Data Explorer for Excel 2010/2013

Chris Webb posted about Data Explorer being available for public preview and I thought I will give it a quick spin. You can download it here . The installation is pretty straightforward. After the installation, when you open Excel (I am using Excel 2010), you would see a new Data Explorer tab like below.

Data Explorer Tab in Excel 2010

Data Explorer Tab in Excel 2010

Data Explorer can import data from a number of sources including Facebook (How cool is that!!!). The complete list of sources and how-to is here. For the purpose of this small exercise, I am using data from Lending Club. It is available to download from their website.
Once the data source is et up and connected, new query screen shows up and this is where many of the transformations can be done. I am going to list few here which can be of use to general users.

1. Change Type of column
By default the data type of the column is set to Auto-Detect. You can change it to your specific data type by right-clicking on the column and selecting the type from Change Type option.

Change Data Type of Column in Data Explorer

Change Data Type of Column in Data Explorer

2. Transformations
Similar to changing the data type, you can do some simple transformations on the columns. To do this, right click on the column in the query window and select Transform. The options in Transform menu will depend on the data type of the column. For numeric columns, this option is greyed out.

Data Transform Options in Data Explorer

Data Transform Options in Data Explorer

3. Adding a new column
New columns can be very easily added and formulas can be used. The complete formula reference is listed on Office help page.
For example, in the below screenshot I am adding a new column which will take the lower range of FICO Score (FICO score is like credit score in US).

Add a new column in Data Explorer

Add a new column in Data Explorer

4. Undo transform
One of the nice features was the ability to Undo a transform (oh!! how I miss the undo button in SSIS). On the left hand side, you would see Steps section. If you expand that, it will list all the steps you have taken so far to transform data. Clicking the X against it will undo the transform you did in that step. If you try to remove a step other than the last step, you get a warning.
Some other handy transforms are Split, Replace Values and Group By and I think they are self explanatory.

I did face an error though while doing some transformations. The file mentioned above contained 11428 records. After doing some transformations, I was trying to save the Query and get the data in a workbook, at which point I was persistently getting the following error. Not to mention, I sent a lot of frowns to MS. What worked was, using a sample of rows to do transformations and then import all in the workbook.

Error in Data Explorer After saving query

Error in Data Explorer After saving query

All in all, it’s a nice utility to do some simple transformations and combine data from different sources in excel. Soon I am planning to take look at how it works with Facebook data.

Run SQL scripts from a folder using sqlcmd

This is a very short post about a very useful tip. I came across here on   stackoverflow and it certainly deserves a post. Many thanks to the answerer and I think this is his blog. I already follow it and would highly encourage you if you are interested in SQL Server.

Very often I need to execute all the SQL scripts from a folder. There are tools out there to do this but I wanted something very simple, like a batch command. So let’s say you want to execute all the SQL scripts from folder ‘D:\Scripts’  on server ‘localhost’ and on database ‘playbox’. Here are the steps

1. Open command prompt.

2. Navigate to the scripts folder.

3. Type the following command.

for %f in (*.sql) do sqlcmd /S localhost /d playbox /E /i "%f"

What this will do is iterate through all the .sql files in the folder and execute each one using sqlcmd. /S, /d, /E  are sqlcmd parameters. The beauty of this solution is you don’t have to install anything. Wrap a few of these statements in a batch file and you can execute statements from different folders in one go. Neat!!!

Alessandro Alpi's Blog

Usually I found some questions on the forums about that topic. For instance:

  1. “WHERE or HAVING?”
  2. “Is the ON clause more efficient than the WHERE clause?”
  3. “Why the field aliases cannot be used with the GROUP BY clause?”
These three questions are syntax oriented. Actually there are a lot of requests about subqueries, temporary objects, sort operations and so on. This kind of questions can be replied reading this logical process document (pdf). I love it and I share it everywhere, also when training in classes/on the job.
It is a LOGICAL workflow used by the Query Processor in order to generate the related PHYSICAL process for retrieving data.

View original post 492 more words

business intelligist

Microsoft once again is ranked the highest in its ability to execute by Gartner in its magic quadrant for Business Intelligence.  The report is pretty long and boring, but if I were distill it down to a few salient points, I would choose the following:

  1. Companies see a dramatic improvement in BI capabilities going from SQL Server 2008 R2 to SQL Server 2012
  2. Microsoft got a node for some of the upcoming stuff including Explorer, GeoFlow and PolyBase
  3. Gartner mentioned “shortening product update cycles” for Microsoft SQL, Office and Sharepoint
  4. “it is widely deployed in large enterprises as a standard with among the highest data volumes and user counts”
  5. Microsoft is dinged on basically two fronts, mobility and CPM.

View original post

Sieve of Eratosthenes in R and SQL

I just completed Computing for Data Analysis on Coursera. The course is brief introduction to R programming language. R has been around for years but is gaining much attention recently due to Big Data and Data Science trends. I had an idea about R and the course offered a wonderful opportunity to learn in a systematic manner. So for some more practice and a bit of fun (ok, I admit, more for fun than practice), I decided to implement ‘Sieve of Eratosthenes’ in R and SQL and see which one is faster (because that’s what you do on a lazy Saturday!!) This is a method to find primes up to a given number. The R code looks like this.


getPrimeNumTilln <- function(n) {

# a holds a sequence of numbers from 2 to n
a <- c(2:n)
# we start from 2 since it is the beginning of prime numbers,
# it is also the loop varibale
l <- 2
# r this vector holds the results
r <- c()
#while the square of loop variable is less than n
while (l*l < n) {
# the prime number is the first element in vector a
# for e.g. in first iteration it will be 2
r <- c(r,a[1])

# remove elements from a which are multiples of l
# for e.g. in first iteration it will remove 2,4,6,8…
a <- a[-(which(a %% l ==0))]

# the loop varibale if the first variable in remaining a
# for e.g after first iteration, it will be 3, then 5 (since 4 has been removed)…
l <- a[1]
}
# the result is r and all the remaining elements in a
c(r,a)
}

And the SQL code looks like this.

DECLARE @maxnum INT = 1000 /* The number under which you want to find primes*/

DECLARE @l INT = 2 /* Beginning of prime numbers */

DECLARE @table TABLE (a INT NOT NULL PRIMARY KEY) /* Holding table*/

;WITH ct /* Generate the sequence of numbers*/
AS (
SELECT 2 AS id

UNION ALL

SELECT id + 1
FROM ct
WHERE id < @maxnum
)

INSERT INTO @table
SELECT id
FROM ct
OPTION (MAXRECURSION 0)

WHILE (@l * @l < @maxnum)
BEGIN
/*remove records which are divisible by l*/
DELETE
FROM @table
WHERE a != @l
AND (a % @l) = 0

/* the first remaining number is the prime number */
SELECT @l = MIN(a)
FROM @table
WHERE a > @l
END

SELECT COUNT(*)
FROM @table
Now, I am no expert in either maths or algorithms but that looks neat. To validate that the results are right, I ran it to check how many prime numbers are under 1000000 and both returned 78948. Wolfram Alpha seems to agree.

For smaller up to  10k, the results are comparable; they are in milliseconds. Above that,however,R seems to have an upper hand.

n R SQL
100000 0.02 1.72
1000000 0.56 19.00
10000000 11.25 246.21

Please note that the time is in seconds. The difference is quite stark especially for large n. R is killing SQL.

A few observations on SQL side are

  1. A significant time is being spent on generating sequence. With SQL Server 2012 Sequences, we might be able improve time.
  2. The delete operation is quite slow as we all know. I tried replacing it with update but that made it worse.

I know that there are many improvements that can be made to this but I am happy with my little testing. As usual comments are always welcome.