Sunday, February 8, 2009

String comparison: binary vs. dictionary

It is well known that you get better performance if you compare two strings in a binary collation than in a dictionary collation. But is the performance difference significant enough to warrant the use of an explicit COLLATE clause in the string comparison expression? That was a question came up in a conversation I had with a colleague recently.

To get a feel for the extent of the performance difference, I ran several tests comparing two commonly-used collations:

· Latin1_General_BIN, and
· SQL_Latin1_General_CP1_CI_AS

More specifically, I ran the following T-SQL script in a loop for 100 times:

DECLARE @s1 varchar(max), @s2 varchar(max), @i int
DECLARE @dummy int, @dt datetime
DECLARE @length int

SET @length = 1000000 -- or 100, or 1000000

SELECT @s1 = REPLICATE(CAST('a' as varchar(max)), @length)
SELECT @s2 = REPLICATE(CAST('a' as varchar(max)), @length)

SELECT @i = 1, @dt = GETDATE()

WHILE @i < 1000
BEGIN
IF @s1 = @s2 COLLATE SQL_Latin1_General_CP1_CI_AS
--IF @s1 = @s2 COLLATE Latin1_General_BIN
SET @dummy =0
SET @i = @i + 1
END

SELECT DATEDIFF(ms, @dt, GETDATE())


The script was run in a number of scenarios:


· The @length variable was set to 100, 10,000, and 1,000,000 to see the impact of the string length on the comparison method,

· The string comparison was done in the IF clause with either the Latin1_General_BIN collation or the SQL_Latin1_General_CP1_CI_AS collation

The most salient point to note is that as the string length increases, the performance difference between the two comparison methods increases. While there was no performance difference when the strings were 100 bytes long, the binary comparison was about twice as fast as the dictionary comparison when the string length was 10,000 bytes. When the string length increased to 1,000,000, the binary comparison was about seven times faster.

So, if you are comparing two short strings, you probably shouldn’t bother to explicitly cast the comparison into a binary collation. But when the strings can be very long, it can result in a huge performance improvement to explicitly specify a binary collation for string comparison.

Note that string comparison in COLLATE Latin1_General_BIN and string comparison in COLLATE SQL_Latin1_General_CP1_CI_AS have different semantics. But there are many cases where you have control over your data, and the difference in comparison semantics does not matter (e.g. when you know your data is all in lower case).

Nothing is particularly new here. Just want to contribute some data points to the community.

No comments: