Thursday, October 23, 2008

Super Sizing Columns in SQL Server

SQL Server 2005, columns can also be Super Sized due to the introduction of the MAX Specifier. In previous versions of SQL Server, if an application allowed for the storage of string data that would exceed 8000 bytes, the only option available was to use the TEXT or NTEXT data type. By using either one of these data types, common operators were unable to be used, meaning that tasks such as searching and updating data was a complex process. With the introduction of the MAX Specifier it is now possible to work with large objects in SQL Server in ways that were previously not possible.

With the introduction of the MAX Specifier there is no longer a need to perform the complex manipulation of large objects that requires the use of the TEXTPTR operator to determine a the pointer to the value before using a set of specialized commands. The following examples show the complexity involved with working with the TEXT and NTEXT data types as a result of having to use a different set of operators. The example below illustrates how to find the first 10 characters for the pr_info column in the pub_info table:

DECLARE @ptrval varbinary(16);

SELECT @ptrval = TEXTPTR(pr_info)
FROM pubs.dbo.pub_info
WHERE pub_id = '0736'

READTEXT pub_info.pr_info @ptrval 0 10;

Whereas the following example illustrates how to update the first 10 characters for the pr_info column which is a TEXT data type:

DECLARE @ptrval binary(16)

SELECT @ptrval = TEXTPTR(pr_info)
FROM pubs.dbo.pub_info
WHERE pub_id = '0736'

UPDATETEXT pub_info.pr_info @ptrval 0 4 'This'

The introduction of the MAX Specifier in SQL Server 2005 provides the ability for variable length columns that previously were limited to 8000 bytes to store large amounts of data. The introduction also means that the NTEXT, TEXT and IMAGE data types are candidates for not being supported in future versions of SQL Server. Hence, it is recommended that the MAX Specifier is used for the storage of large amounts of data.

The MAX Specifier increases the maximum storage capabilities of the VARCHAR and VARINARY data types up to 2^31-1 bytes and up to 2^30-1 bytes for NVARCHAR. VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX) are collectively called large-value data types. The MAX Specifier is in effect, a very large variable length column. Although the maximum size of the MAX specifier is approximately 2GB, the size is actually the maximum size that SQL Server supports. This means that in future versions of the product if the maximum size supported increases, the MAX specifier will automatically be able to support the size increase without any modifications. To use the MAX Specifier the word MAX is used in place of a size when you define a column or variable. In the following example the LargeColumn in the LargeDataType table is created using the VARCHAR data type and the MAX Specifier as the size.

CREATE TABLE dbo.LargeDataType
(
LargeColumn VARCHAR(MAX)
)

There are two ways in SQL Server that columns that are defined with the MAX Specifier may be stored. They can be stored either in a page with the other columns in a row or alternatively off-page. When the MAX Specifier is used for a column, SQL Server uses its own algorithms to determine whether to keep the value in line within the physical record or store the value externally to the base record and to keep track of the value by using a pointer. SQL Server will store the data as either a VARCHAR, NVARCHAR, VARBINARY or as a Large Object (LOB). If the length of the column is less than or equal to 8000 bytes, SQL Server will store the data in-page and where it is greater than 8000 bytes SQL Server will check the row size to determine the appropriate storage. If the row size is less than the size of a page (8060 bytes), the data values will be stored as in-row data whereas if the row size is greater than 8060 bytes, the data values are stored as LOB data with only a 16 byte pointer stored in the row. It is possible to override this default behaviour by using the new table option called large value type out of row so that columns defined with the MAX Specifier are always stored as a LOB. The following example illustrates how this option can be enabled by using the system Stored Procedure sp_tabeloption.

EXEC sp_tableoption
'dbo.LargeDataType', 'large value types out of row', 1

When the option is set to 1 (enabled) the data in columns that have been defined with the MAX specifier will always be stored out of row as a LOB with only a 16-byte text pointer stored in the row. Text pointers point to the root node of a tree built of internal pointers that map to the pages in which string fragments are actually stored.

The advantage of storing large-value data types in-row with the other columns in a table is that SQL Server can return a single row with only one I/O operation. If the bulk of SQL Server statements do not return large-value data type columns then the data should be stored out of row. This allows for a greater number of rows to be stored on a data page allowing a greater number of rows to be returned for each I/O operation.

Unlike TEXT and NTEXT data types that store the data off page, it does not matter where the data is stored for columns defined with the MAX Specifier. As no matter where the data is stored the column can be treated as a standard variable length data type. Hence, different operators do not need to be used and all of the standard operators can be used. This means that in SQL Server, there is now a unified programming model for working with regular types and large objects. The restrictions that previously existed for the use of TEXT and NTEXT as variables in Stored Procedures and Functions also no longer exists with large-value data types.

The following example illustrates just a few of the ways that standard operators can now be used with large-value data types.

String concatenation can now be used with large data types. The thing to note with this example is the use of the REPLICATE function. The REPLCIATE function returns a character expression of the same type as the supplied character expression. So if the supplied expressions is not CAST as a large data type the MAXIMUM length of the expression returned would be 8000 characters.

INSERT INTO dbo.LargeDataType(LargeColumn)
SELECT 'There is lots of data in this row ' +
REPLICATE(CAST('x' AS VARCHAR(MAX)), 100000)

Updates can be made directly to large data types without the need to use the UPDATETEXT operator:

UPDATE dbo.LargeDataType
SET LargeColumn = REPLACE(LargeColumn, 'lots', 'lots and lots')

Standard string operators such as SUBSTRING can now be used with large data types

SELECT SUBSTRING(LargeColumn, 10, 4)
FROM dbo.LargeDataType

The support for large data types is a valuable new addition to SQL Server. So if you are designing a new application that needs to store vales that are greater than 8000 bytes you should use the new large-value data types. As not only will the use of large-value data types assist by providing a unified programming model, it will also ensure that your application can take advantage of additional storage in future versions of SQL Server, allowing you to really Super Size your columns.

No comments: