Table variables are fast but #Temp tables are faster!!

This blog is sort of note to myself. Use-temp-tables-instead-of-table-variables-when-the-number-of-records-is-large. There it is and I promise to remember that every time I write DECLARE @something TABLE (… statement.
And now a bit of explanation. I recently worked on project where we added a new fact table to our datawareshouse. An SSIS package was created to get the data from source, do the dimension lookup and populate the fact table. Simple stuff, except that I had to do some data manipulation while selecting data from the source. Now, I could have just written a stored procedure and used that as data source but we are all aware of the pitfalls of that so I decided to use a table valued functions (TVF). TVFs have predictable metadata output and SSIS plays nice with them. However, you cannot use temp tables in TVF so invariably I ended up using a table variable. The package worked perfectly for smaller workloads but when we increased the number of records in table variable, the performance became more and more severe. A casual google search came up with a raft of blog post suggesting that performance culprit might be table variable. So I changed the function to a stored procedure with temp table and used the stored procedure in SSIS. Sure enough our performance issue was resolved like it never existed. Now, there are many blog posts by very intelligent people on why temp tables perform better than table variables but I wanted to check first hand and here’s my little investigation.

1. Create two temp table and poupulated them with random data. These have nothing to do with actual results they are just used for data population.

/* Create base tables which will contain random generated data */
IF OBJECT_ID('tempdb..#basetable') IS NOT NULL
	DROP TABLE #basetable

IF OBJECT_ID('tempdb..#basetable2') IS NOT NULL
	DROP TABLE #basetable2

CREATE TABLE #basetable (
	id INT
	,pdate DATETIME
	,sdate DATETIME
	)

CREATE TABLE #basetable2 (
	id INT
	,pdate DATETIME
	);

WITH id
AS (
	SELECT id = 1
	
	UNION ALL
	
	SELECT id + 1
	FROM id
	WHERE id <= 1000
	)
INSERT INTO #basetable
SELECT id
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
FROM id
OPTION (MAXRECURSION&nbsp;0)

;WITH id
AS (
	SELECT id = 1
	
	UNION ALL
	
	SELECT id + 1
	FROM id
	WHERE id <= 1000
	)
INSERT INTO #basetable2
SELECT id
	,cast(cast(RAND() * 100000 AS INT) AS DATETIME)
FROM id
OPTION (MAXRECURSION 0)

2. Get the records generated in temp table above into two table variable, join them and update the first one with values from the second one.

/* Get the records generated above in two temp variables. 
Join the table variables on id column and update pdate&nbsp;column */
DECLARE @table1&nbsp;TABLE (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)
DECLARE @table2&nbsp;TABLE (
	id INT
	,pdate&nbsp;DATETIME
	)

INSERT INTO @table1
SELECT id
	,pdate
	,sdate
FROM #basetable

INSERT INTO @table2
SELECT id
	,pdate
FROM #basetable2

UPDATE @table1
SET pdate&nbsp;= t2.pdate
FROM @table1&nbsp;t1
LEFT JOIN @table2 t2
	ON t1.id = t2.id

If you see the estimated and actual execution plan below, they are exactly same. SQL Server optimizer decides to use nested loop join because there are no statistics on table variables.
Not to mention this is very slow specially when I increased the number of records to over a million.

Estimated Query Execution plan when using table variable

Estimated Query Execution plan when using table variable

Actual Query Execution plan when using table variables

Actual Query Execution plan when using table variables

3. Get the records generated in temp table from step 1 in another set of temp tables, join them and update the first one with values from second one.

 
/* Get the records generated above in two temp tables. 
Join the two temp tables on id column and update pdate&nbsp;column */

IF OBJECT_ID('tempdb..#table1') IS NOT NULL
	DROP TABLE #table1

IF OBJECT_ID('tempdb..#table2') IS NOT NULL
	DROP TABLE #table2

CREATE TABLE #table1 (
	id INT
	,pdate&nbsp;DATETIME
	,sdate&nbsp;DATETIME
	)

CREATE TABLE #table2 (
	id INT
	,pdate&nbsp;DATETIME
	)

INSERT INTO #table1
SELECT id
	,pdate
	,sdate
FROM #basetable

INSERT INTO #table2
SELECT id
	,pdate
FROM #basetable2

UPDATE #table1
SET pdate = t2.pdate
FROM #table1 t1
LEFT JOIN #table2 t2
	ON t1.id = t2.id

Now look at the estimated and actual execution plan. SQL Server estimated using nested loop join but used hash-match which is faster in this situation. SQL Server was able to do this because there are statistics available on temporary tables.

Estimated Query Execution plan when using temp table

Estimated Query Execution plan when using temp table

Actual Query Execution plan when using temp table

Actual Query Execution plan when using temp table

This does not absolutely mean that table variables are not useful in any situation. They are brilliant when number of data in them is small (usually less than 1000 rows as rule of thumb). Creating a temp table in this situation would not give much, if at all, benefits. They are light on transaction log as well since their scope is limited to current execution context and they also lead to fewer stored procedure recompilation if used in stored procedure.
So next time you decide to use table variable have a second thought. Think a bit about the effect on the query execution plan and definitely don’t use them if you care moving large amount of data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s