SQL

Sieve of Eratosthenes in R and SQL

I just completed Computing for Data Analysis on Coursera. The course is brief introduction to R programming language. R has been around for years but is gaining much attention recently due to Big Data and Data Science trends. I had an idea about R and the course offered a wonderful opportunity to learn in a systematic manner. So for some more practice and a bit of fun (ok, I admit, more for fun than practice), I decided to implement ‘Sieve of Eratosthenes’ in R and SQL and see which one is faster (because that’s what you do on a lazy Saturday!!) This is a method to find primes up to a given number. The R code looks like this.


getPrimeNumTilln <- function(n) {

# a holds a sequence of numbers from 2 to n
a <- c(2:n)
# we start from 2 since it is the beginning of prime numbers,
# it is also the loop varibale
l <- 2
# r this vector holds the results
r <- c()
#while the square of loop variable is less than n
while (l*l < n) {
# the prime number is the first element in vector a
# for e.g. in first iteration it will be 2
r <- c(r,a[1])

# remove elements from a which are multiples of l
# for e.g. in first iteration it will remove 2,4,6,8…
a <- a[-(which(a %% l ==0))]

# the loop varibale if the first variable in remaining a
# for e.g after first iteration, it will be 3, then 5 (since 4 has been removed)…
l <- a[1]
}
# the result is r and all the remaining elements in a
c(r,a)
}

And the SQL code looks like this.

DECLARE @maxnum INT = 1000 /* The number under which you want to find primes*/

DECLARE @l INT = 2 /* Beginning of prime numbers */

DECLARE @table TABLE (a INT NOT NULL PRIMARY KEY) /* Holding table*/

;WITH ct /* Generate the sequence of numbers*/
AS (
SELECT 2 AS id

UNION ALL

SELECT id + 1
FROM ct
WHERE id < @maxnum
)

INSERT INTO @table
SELECT id
FROM ct
OPTION (MAXRECURSION 0)

WHILE (@l * @l < @maxnum)
BEGIN
/*remove records which are divisible by l*/
DELETE
FROM @table
WHERE a != @l
AND (a % @l) = 0

/* the first remaining number is the prime number */
SELECT @l = MIN(a)
FROM @table
WHERE a > @l
END

SELECT COUNT(*)
FROM @table
Now, I am no expert in either maths or algorithms but that looks neat. To validate that the results are right, I ran it to check how many prime numbers are under 1000000 and both returned 78948. Wolfram Alpha seems to agree.

For smaller up to  10k, the results are comparable; they are in milliseconds. Above that,however,R seems to have an upper hand.

n R SQL
100000 0.02 1.72
1000000 0.56 19.00
10000000 11.25 246.21

Please note that the time is in seconds. The difference is quite stark especially for large n. R is killing SQL.

A few observations on SQL side are

  1. A significant time is being spent on generating sequence. With SQL Server 2012 Sequences, we might be able improve time.
  2. The delete operation is quite slow as we all know. I tried replacing it with update but that made it worse.

I know that there are many improvements that can be made to this but I am happy with my little testing. As usual comments are always welcome.