I just completed Computing for Data Analysis on Coursera. The course is brief introduction to R programming language. R has been around for years but is gaining much attention recently due to *Big Data* and *Data Science* trends. I had an idea about R and the course offered a wonderful opportunity to learn in a systematic manner. So for some more practice and a bit of fun (ok, I admit, more for fun than practice), I decided to implement ‘Sieve of Eratosthenes’ in R and SQL and see which one is faster (because that’s what you do on a lazy Saturday!!) This is a method to find primes up to a given number. The R code looks like this.

getPrimeNumTilln <- function(n) {

# a holds a sequence of numbers from 2 to n

a <- c(2:n)

# we start from 2 since it is the beginning of prime numbers,

# it is also the loop varibale

l <- 2

# r this vector holds the results

r <- c()

#while the square of loop variable is less than n

while (l*l < n) {

# the prime number is the first element in vector a

# for e.g. in first iteration it will be 2

r <- c(r,a[1])

# remove elements from a which are multiples of l

# for e.g. in first iteration it will remove 2,4,6,8…

a <- a[-(which(a %% l ==0))]

# the loop varibale if the first variable in remaining a

# for e.g after first iteration, it will be 3, then 5 (since 4 has been removed)…

l <- a[1]

}

# the result is r and all the remaining elements in a

c(r,a)

}

And the SQL code looks like this.

DECLARE @maxnum INT = 1000 /* The number under which you want to find primes*/

DECLARE @l INT = 2 /* Beginning of prime numbers */

DECLARE @table TABLE (a INT NOT NULL PRIMARY KEY) /* Holding table*/

;WITH ct /* Generate the sequence of numbers*/

AS (

SELECT 2 AS id

UNION ALL

SELECT id + 1

FROM ct

WHERE id < @maxnum

)

INSERT INTO @table

SELECT id

FROM ct

OPTION (MAXRECURSION 0)

WHILE (@l * @l < @maxnum)

BEGIN

/*remove records which are divisible by l*/

DELETE

FROM @table

WHERE a != @l

AND (a % @l) = 0

/* the first remaining number is the prime number */

SELECT @l = MIN(a)

FROM @table

WHERE a > @l

END

SELECT COUNT(*)

FROM @table

Now, I am no expert in either maths or algorithms but that looks neat. To validate that the results are right, I ran it to check how many prime numbers are under 1000000 and both returned 78948. Wolfram Alpha seems to agree.

For smaller *n *up to 10k, the results are comparable; they are in milliseconds. Above that,however,R seems to have an upper hand.

n | R | SQL |

100000 | 0.02 | 1.72 |

1000000 | 0.56 | 19.00 |

10000000 | 11.25 | 246.21 |

Please note that the time is in seconds. The difference is quite stark especially for large n. R is killing SQL.

A few observations on SQL side are

- A significant time is being spent on generating sequence. With SQL Server 2012 Sequences, we might be able improve time.
- The delete operation is quite slow as we all know. I tried replacing it with update but that made it worse.

I know that there are many improvements that can be made to this but I am happy with my little testing. As usual comments are always welcome.