Sunday, December 28, 2008

Indexes for aggregates

It’s well known that indexes on columns used in where clause and for joins is a good thing in SQL, but what about other places. How about on aggregates?

Consider a simple table with an amount and a customerID. It’s a common requirement to calculate the total amount that each customer has paid. No conditions are enforced, so this would seem like a place where an index won’t help. Well, let’s see. (sample code at end)

The clustered index (and hence the physical order of the rows) is on the identity column.Take the following query.
SELECT CustomerID, SUM(Amount) FROM Payments group by customerID

Without any additional indexes, SQL will execute that as a hash match (aggregate) which comprises 63% of the query’s cost. That is because the data is not ordered by the grouped column, and hence cannot be simply be summed.

Now add an index on CustomerID, Amount and run the same query. Now, the aggregate can be satisfied just with the non clustred index and, more importantly, the data is in order of customerID, so just a stream aggregate is required and the data does not have to be resorted or run through a hashing algorithm. Moreover, the stream aggregate is only 15% of the query’s cost

CREATE TABLE Payments (
PaymentKey INT IDENTITY PRIMARY KEY,
CustomerID int,
InvoiceID int,
PaymentDate datetime,
Amount Numeric(17,5)
)

;WITH DataPopulate (RowNo, CustomerID ,InvoiceID ,ADate, Amount) AS (
SELECT 1 AS RowNo, FLOOR(RAND()*40) as CustomerID, FLOOR(RAND()*200) as InvoiceID,
DATEADD(dd, FLOOR(RAND()*75454),'1753/01/01'), rand()*856542 AS Amount
UNION ALL
SELECT rowNo+1, FLOOR(RAND(RowNo*85455)*40) AS CustomerID,
FLOOR(RAND(RowNo*85455)*500) AS InvoiceID,
DATEADD(dd, FLOOR(RAND(RowNo*96322)*85454),'1753/01/01'),
rand(RowNo*25411)*86542 AS Amount
FROM DataPopulate WHERE RowNo<10000
)
INSERT INTO Payments (CustomerId, InvoiceID, PaymentDate, Amount)
SELECT CustomerID ,InvoiceID ,ADate, Amount FROM DataPopulate
OPTION (MAXRECURSION 10000)

No comments: