Wednesday, October 8, 2008

Exists Vs. Count(*) - The battle never ends..

Many of the database applications written today still disregard some basic rules of thumb when it comes to accessing the data. One in particular is the use of COUNT(*) to check to see if there are any rows that match some criteria. The technique of using EXISTS over COUNT(*) has been widely publicized and is in pretty much every best practices document I have come across. So why are database developers still using COUNT(*) instead of EXISTS all over the place? If it is because people just don't believe or recognize the benefits of it? Or is the concept of EXISTS difficult for new programmers to grasp? Lets see demo of why EXISTS is almost always a better way to code when you simply need to see if there is at least 1 row that matches some condition in the WHERE clause. Of course if you really need to know exactly how many match that condition then COUNT(*) is appropriate so hopefully this won't confuse anyone in that regard.

Lets use the Adventureworks database and turn statistics IO on so we can see the number of reads associated with each query. We will then compare COUNT(*) with EXISTS so there is no mistake on how much work is being done in relation to each other.

USE Adventureworks
GO

SET STATISTICS IO ON

GO

First lets look at a situation in which there is an index to satisfy the WHERE clause and there are only 2 matching rows:

IF (SELECT COUNT(*) FROM sales.salesorderdetail
WHERE ProductID = 870 ) > 0

Print 'Yes'

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ProductID = 870)
Print 'Yes'

Table 'SalesOrderDetail'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As we can see from the Logical reads there was only a difference of 1 between the two queries since the amount of matching rows was so small. So in this case there was not a significant difference although it was still 1/3 more expensive to use COUNT(*) instead of EXISTS.

Now lets do the same but with 4688 matching rows:

IF (SELECT COUNT(*) FROM sales.salesorderdetail
WHERE ProductID = 897 ) > 0

Print 'Yes'

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ProductID = 897)
Print 'Yes'

Table 'SalesOrderDetail'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now the cost for the COUNT(*) is over 5 times the EXISTS. This still may not seem like a lot to you. But if you were calling this queries thousands of times a second this would be a big deal. And remember this index is pretty small overall still.

OK now lets try this on a column with no index in that same table. There are 357 rows that match but there are a total of 121,317 in the table.

IF (SELECT COUNT(*) FROM sales.salesorderdetail
WHERE ModifiedDate = '20010701 00:00:00.000' ) > 0

Print 'Yes'

IF EXISTS (SELECT * FROM sales.salesorderdetail WHERE ModifiedDate = '20010701 00:00:00.000')
Print 'Yes'

Table 'SalesOrderDetail'. Scan count 1, logical reads 1241, physical reads 0, read-ahead reads 331, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

This is now almost 250 times more expensive to do a COUNT(*) vs. an EXISTS. Both queries scanned the table but the EXISTS was able to at least do a partial scan do to the fact it can stop after it finds the very first matching row. Where as the COUNT(*) must read each and every row in the entire table to determine if they match the criteria and how many there are. That is the key folks. The ability to stop working after the first row that meets the criteria of the WHERE clause is what makes EXISTS so efficient. The optimizer knows of this behavior and can factor that in as well. Now keep in mind that these tables are relatively small compared to most databases in the real world. So the figures of the COUNT(*) queries would be multiplied many times on larger tables. You could easily get hundred's of thousands of reads or more on tables with millions of rows but the EXISTS will still only have just a few reads on any queries that can use an index to satisfy the WHERE clause.

Hopefully this will help to persuade those last holdouts who insist on using COUNT(*) everywhere even when EXISTS is the clear choice. One last note, make sure to turn off the statistics IO when done.

SET STATISTICS IO OFF

No comments: