I just completed Computing for Data Analysis on Coursera. The course is brief introduction to R programming language. R has been around for years but is gaining much attention recently due to *Big Data* and *Data Science* trends. I had an idea about R and the course offered a wonderful opportunity to learn in a systematic manner. So for some more practice and a bit of fun (ok, I admit, more for fun than practice), I decided to implement ‘Sieve of Eratosthenes’ in R and SQL and see which one is faster (because that’s what you do on a lazy Saturday!!) This is a method to find primes up to a given number. The R code looks like this.

getPrimeNumTilln <- function(n) {

# a holds a sequence of numbers from 2 to n

a <- c(2:n)

# we start from 2 since it is the beginning of prime numbers,

# it is also the loop varibale

l <- 2

# r this vector holds the results

r <- c()

#while the square of loop variable is less than n

while (l*l < n) {

# the prime number is the first element in vector a

# for e.g. in first iteration it will be 2

r <- c(r,a[1])

# remove elements from a which are multiples of l

# for e.g. in first iteration it will remove 2,4,6,8…

a <- a[-(which(a %% l ==0))]

# the loop varibale if the first variable in remaining a

# for e.g after first iteration, it will be 3, then 5 (since 4 has been removed)…

l <- a[1]

}

# the result is r and all the remaining elements in a

c(r,a)

}

And the SQL code looks like this.

DECLARE @maxnum INT = 1000 /* The number under which you want to find primes*/

DECLARE @l INT = 2 /* Beginning of prime numbers */

DECLARE @table TABLE (a INT NOT NULL PRIMARY KEY) /* Holding table*/

;WITH ct /* Generate the sequence of numbers*/

AS (

SELECT 2 AS id

UNION ALL

SELECT id + 1

FROM ct

WHERE id < @maxnum

)

INSERT INTO @table

SELECT id

FROM ct

OPTION (MAXRECURSION 0)

WHILE (@l * @l < @maxnum)

BEGIN

/*remove records which are divisible by l*/

DELETE

FROM @table

WHERE a != @l

AND (a % @l) = 0

/* the first remaining number is the prime number */

SELECT @l = MIN(a)

FROM @table

WHERE a > @l

END

SELECT COUNT(*)

FROM @table

Now, I am no expert in either maths or algorithms but that looks neat. To validate that the results are right, I ran it to check how many prime numbers are under 1000000 and both returned 78948. Wolfram Alpha seems to agree.

For smaller *n *up to 10k, the results are comparable; they are in milliseconds. Above that,however,R seems to have an upper hand.

n | R | SQL |

100000 | 0.02 | 1.72 |

1000000 | 0.56 | 19.00 |

10000000 | 11.25 | 246.21 |

Please note that the time is in seconds. The difference is quite stark especially for large n. R is killing SQL.

A few observations on SQL side are

- A significant time is being spent on generating sequence. With SQL Server 2012 Sequences, we might be able improve time.
- The delete operation is quite slow as we all know. I tried replacing it with update but that made it worse.

I know that there are many improvements that can be made to this but I am happy with my little testing. As usual comments are always welcome.

Hi Swanand,

As you say, the heaviest weight tied to your SQL performance is the sequence generating part. Try this instead and maybe then re-compare your SQL and R solutions.

WITH

L0 AS(SELECT 1 AS c UNION ALL SELECT 1),

L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)

SELECT TOP (@maxnum) n FROM Nums ORDER BY n;