Tuesday, March 31, 2009

Gmail in 5 Indian Languages Hindi, Tamil, Telugu, Kannada and Malayalam

It's hard for me to imagine going without email for a day. It's such an easy and convenient way to communicate with my friends and family. However, there was one limitation that bothered me: my family members and friends who prefer to communicate in Hindi did not have an easy way to type and send email in their language of choice. I am extremely happy to announce the launch of a new feature in Gmail that makes it easy to type email in Indian languages.

When you compose a new mail in Gmail, you should now see an icon with an Indian character, as the screenshot below shows. This feature is enabled by default for Gmail users in India. If you do not see this function enabled by default, you will need to go the "Settings" page and enable this option in the "Language" section.

When you click the Indian languages icon, you can type words the way they sound in English and Gmail will automatically convert the word to its Indian local language equivalent. For example, if a Hindi speaker types "namaste" we will transliterate this to "नमस्ते." Similarly, "vanakkam" in Tamil will become "வணக்கம்." We currently support five Indian languages -- Hindi, Tamil, Telugu, Kannada and Malayalam -- and you can select the language of your choice from the drop-down list next to the icon.

They built this new feature using Google's transliteration technology, which is also available on Google India Labs, Orkut, Blogger and iGoogle. I hope you find this feature useful to communicate with those of your friends and family who prefer to write in their native language, and it will be available soon to businesses and schools using Google Apps. Now back to replying to all those Hindi emails I got from my family and friends today!

Sunday, March 29, 2009

How to Minimize SQL Server Blocking

By default, blocking locks do not time out. The waiting connection waits until the lock is released, and the block is over. If you like, you can set a lock time-out so that a connection does not wait indefinitely for the blocking lock to be released. This is accomplished using the LOCK_TIMEOUT setting.

When the LOCK_TIMEOUT setting is used to set a maximum amount of time that a connection can wait for a blocking lock to go away, the connection with the lock that is causing the blocking problem is not affected, but the connection waiting for the block is halted and an error message is received. When this happens, error message 1222, "Lock request time-out period exceeded," is sent to the application.

This means that the application needs to include the appropriate error-handling code to deal with this situation and take the appropriate action, which includes rolling back the transaction. If the application does not know how to deal with this error message, and the transaction is not rolled back, it is possible that the application can continue as if the transaction was not automatically cancelled. Because of this, you should not use the LOCK-TIMEOUT setting unless your application(s) that will be affected by it know what to do when they receive this message from SQL Server.

The syntax for the SET LOCK_TIMEOUT is:

SET LOCK_TIMEOUT timeout_period

Where timeout_period is the number of milliseconds that a connection waits for a blocking lock to go away before an error is returned from SQL Server to the application. A value of -1 is the default, which means to wait indefinitely. A value of 0 tells SQL Server not to wait at all, and to return the error immediately.

This command is based on a per connection basis, and stays with the connection until the connection is broken, or a new SET LOCK_TIMEOUT command is issued. [7.0, 2000]

A view runs on our development server in 30 seconds, but it takes over 10 minutes on the production server. Why?

ProblemThere are two SQL Server servers, both with virtually the same hardware. One is a development server, where we create and test Transact-SQL code, and the other is production SQL Server. Here's my problem. When a particular view runs on the development server, it takes about 30 seconds. But when the identical view on the production server, the view takes over 10 minutes to run. The databases on both servers are almost identical. Periodically copy the production database to the development server so that we are working with the same data and the same indexes. What could be causing the difference in time for this particular view to run?

Solution
Assuming that the servers and the databases are virtually the same, and the statistics have been updated on both of them, here's what is suggestted you to check. First, do an Execution Plan of the view that is causing the problem on the development server. Find out if this particular view is using parallelism as part of the execution plan. I am guessing that this is the case.

Now, do an Execution Plan of the view on the production server. Find out if this particular view is using parallelism as part of the execution plan. My guess is that it is not.

If I am write so far, then check the SQL Server Parallelism setting for each server. I bet that it is set to "Use All Available CPUs" on the development server, and set to use only one CPU on the production server. If I am right, then set the production server to "Use All Available CPUs," and the performance difference of the same view on both servers should be resolved, and performance should be virtually identical on both servers.

By default, SQL Server is set to use "Use All Available CPUs" on a server. If this is not your current setting, then this has been changed by someone.

Essentially, this setting tells SQL Server whether to attempt to execute queries in parallel using multiple processors. If the default "Use All Available CPUs" option is selected, then SQL Server will attempt to run queries in parallel, using multiple CPUs. This can sometimes result in dramatic performance differences in a query's performance. But if this option is turned off, then SQL Server will only use one CPU to execute a query, which can be much slower.

Keep in mind that if SQL Server is set to use parallelism, that it may not always use it. If the server is very busy, it may not be used, but when the server is less busy, it may be used. This means that sometimes the view will be fast, and other times, slow (depending if parallelism is being used or not).

On the other hand, sometimes parallelism can cause queries to take longer to execute than if parallelism is not used. In these cases, you can use a Query Hint to turn off parallelism for the odd-acting query.

SQL Server Transact-SQL DML

Generally, it is better to perform multiple UPDATEs on records in one fell swoop (using one query), instead of running the UPDATE statement multiple times (using multiple queries).

For example, you could accomplish this two different ways:

USE Northwind
UPDATE Products
SET UnitPrice = UnitPrice * 1.06
WHERE UnitPrice > 5

GO

USE Northwind
UPDATE Products
SET UnitPrice = ROUND(UnitPrice, 2)
WHERE UnitPrice > 5

GO

Or

USE Northwind
UPDATE Products
SET UnitPrice = ROUND(UnitPrice * 1.06, 2)
WHERE UnitPrice > 5

GO

As is obvious from this example, the first option requires two queries to accomplish the same task as the second query. Running one query instead of two or more usually produces the best performance. [6.5, 7.0, 2000, 2005]

Wednesday, March 25, 2009

How many logical and physical processors do you have?

SELECT cpu_count AS [Logical CPUs],
cpu_count / hyperthread_ratio AS [Physical CPUs]
FROM sys.dm_os_sys_info

Sorting different dateformats correctly

DECLARE @Stats TABLE
(
SomeDate DATETIME
)


INSERT @Stats
SELECT 20000 + ABS(CHECKSUM(NEWID())) % 30000
FROM master..spt_values


DECLARE @Style INT


SET @Style = 100


WHILE @Style <= 113
BEGIN
-- Orders by ISO format but displays according to @Style parameter
SELECT TOP 10 @Style AS Style,
CONVERT(VARCHAR(40), SomeDate, @Style) as SomeDate
FROM @Stats
GROUP BY CONVERT(VARCHAR(40), SomeDate, @Style),
CONVERT(VARCHAR(8), SomeDate, 112)
ORDER BY CONVERT(VARCHAR(8), SomeDate, 112) DESC


SET @Style = @Style + 1
END

The general idea is to group by both ISO format and the style you wish and the sort by the ISO format and display the other format.
The reason this work is that the two lines in GROUP BY clause are deterministic.

Performance Tuning for SQL Server Backup and Restore

Depending on the size of your database, select the backup method that is the best tradeoff between backup and restore time. For example, full backups take the longest to perform, but are the fastest to restore. Differential backups are overall faster than full backups, but take longer to restore. Incremental (transaction log) backups are the fastest, but are generally the slowest to restore. [7.0, 2000, 2005]

*****

If you are backing up your databases directly over a network (not really recommended as it hurts performance), one way to boost throughput is to perform the backup over a dedicated network which is devoted to backups and restores only. All devices should be on the same high-speed switch. Avoid going over a router when backing up over a network, as they can greatly slow down backup speed. [6.5, 7.0, 2000, 2005]

Friday, March 13, 2009

SQL Server Transact-SQL DML

Performing UPDATES takes extra resources for SQL Server to perform. When performing an UPDATE, try to do as many of the following recommendations as you can in order to reduce the amount of resources required to perform the UPDATE. The more of the following suggestions you can do, the faster the UPDATE will perform.
If you are UPDATing a column of a row that has a unique index, try to update only one row at a time.


Try not to change the value of a column that is also the primary key.

When updating VARCHAR columns try to replace the contents with contents of the same length.

Try to minimize the UPDATing of tables that have UPDATE triggers.

Try to avoid UPDATing columns that will be replicated to other databases.

Try to avoid UPDATing heavily indexed columns.

Try to avoid UPDATing a column that has a reference in the WHERE clause to the column being updated.

Of course, you may have very little choice when UPDATing your data, but at least give the above suggestions a thought.

Thursday, March 12, 2009

How to Perform Multiple lookups to the same table


In a SQL Server Integration Service (SSIS) package, when lookups are performed on the tables which are inserted in the same package by a previous task, the error “could not match” is raised.

See the failed package layout. From the insert data, two records will be inserted and at the lookup those values are used to get the id. However, the package fails indicating that records are not found even though those records are in the table.

This is happening due to the cache setting in the lookup. There are three types of cache, Full, Partial and No caching.

Full pre-caching, is when the complete reference data set is read before the input is processed. This is the default caching type.

Partial caching, is when the Lookup transformation specifies the size of the cache that is loaded with reference data. This option is available only with connections that support keyed access.

No caching, the reference data set is accessed by each row in the rowset.

As the default caching type is full, before the start of the data flow, it will cache the lookup data to improve the performance. As the data is inserted after the caching, the lookup it will failed. Therefore, for this kind of implementation you need to set caching type to No Cachin

Wednesday, March 11, 2009

Action plan: how to get listed in Google's search results for many keywords

Targeting several keywords on your website is crucial if you want to get high rankings on Google. The more keywords that are related to a special topic can be found on your website, the more likely it is that Google will find your website relevant to that topic.

There are good ways and bad ways to target multiple keywords. If you choose the wrong way, search engines will think that you are a spammer. If you choose the right way, your website will get higher rankings. The following three steps will help you to get high rankings for as many keywords as possible.

Step 1: Remove stuffed keywords from your web pages
Some webmasters think that the best way to optimize a website for multiple keywords is to add them on every web page and separate them with a comma.

These websites usually have extremely long meta tags with dozens if not hundreds of keywords. You won't get high rankings on Google with that method. Add dozens of keywords on a web page is called keyword stuffing and all search engines consider this spam. You actually risk your search engine rankings if you use this method on your web pages.

To do: Check if your web pages contain long keywords lists. If yes, remove them from your pages.

Step 2: Optimize different pages of your website for different keywords
If you add many keyword on the same web page, then the web page will be somewhat relevant to many keywords. If you optimize a web page for one keyword/keyphrase, then your web page will be very relevant to this keyword.

It is much easier to get high rankings if the web page is very relevant to the searched keyword. For that reason, you should optimize as many pages of your website for as many keywords as possible.

The keywords for which you optimize your web pages should be related. If you have a shoe shop, optimize different pages of your website for the keywords sports shoes, sneakers, sandals and boots. By having optimized web pages for these keywords, you show Google that your website is relevant to the topic shoes.

To do: Find the best keywords for your website and then optimize different pages of your website for different keywords.

Step 3: Add keyword modifiers and combine these variations
In step two, you have optimized your web pages for keywords such as sneakers and sandals. The next step is to add modifiers to these keywords on your optimized pages. For example, the sneakers page could include these keyword variations: sneakers shop, nike sneakers, cool sneakers, etc.

It's usually a good idea to combine all of these keyword variations in one sentence (for example for your web page title): Buy cool Nike sneakers in our sneakers shop in London. With this single sentence, your web page can be found for many keyword combinations that are included in this sentence.

To do: Find keyword modifiers and add different keyword variations to your optimized pages.

If you optimize many pages of your website for different but related keywords and if you add variations of each keyword to each page than the content of your web site will be perfectly optimized for high rankings on Google.

If you combine optimized content with good inbound links, your website will get the best possible rankings on Google and other search engines.

Nine factors that affect your website usability

Optimizing web pages for search engines does not mean creating special pages for search engines. Optimizing web pages for search engines is often the same as optimizing web pages for web surfers.

If you do it correctly, your website will be attractive to both web surfers and search engine spiders. The following list shows nine factors that can improve the usability of your website as well as your search engine rankings.

1. You should have fast loading web pages
Usability: Web surfers don't want to wait for web pages.
Search engine optimization: Search engines can index your web pages more easily.

2. Your web pages should be easy to read
Usability: It's easier for web surfers to read your web pages.
Search engine optimization: Near-white text on a white background and tiny text is considered spam by most search engines.

3. The contents of your web pages should be clearly arranged
Usability: Clear headings, paragraphs and bullet lists make your web pages easier to read.
Search engine optimization: Clear headings, paragraphs and bullet lists make it easier for search engines to find the topic of your web pages.

4. Your web page images should use the IMG ALT attribute
Usability: Web surfers with images turned off and visually impaired visitors will be able to see the content of your images.
Search engine optimization: Search engines cannot index the content of your images but they can index the content of the IMG ALT attribute.

5. You should use custom 404 not found error pages
Usability: If your 404 not found page contains links to other pages of your website or a search form then people might remain on your website.
Search engine optimization: Proper 404 error pages make sure that search engines index the right pages of your website.

6. Your website should be easy to navigate
Usability: Clear and concise navigation links that are easy to find help your website visitors to find content on your site.
Search engine optimization: Clear and concise navigation links that contain your keywords make it easy for search engines to index all of your web pages.

7. Important content is above the fold
Usability: Web surfers with small computer screens can quickly see what your web page is about.
Search engine optimization: The sooner your important content appears in the HTML code of your web pages, the more likely it is that it will be indexed by search engines.

8. Your web page titles are explanatory
Usability: If web surfers bookmark your web pages, a clear web page title will help them to find it again.
Search engine optimization: The web page title is one of the most important SEO elements. It should contain your keywords and it should look attractive so that web surfers click on it when they see your web page title in the search results.

9. The URLs of your web pages are meaningful and self-explanatory
Usability: It's much easier to remember a web page like www.example.com/support than a web page like www.example.com/123123-werwc.php?2342234.
Search engine optimization: If your URLs contain your keywords, this can have a positive effect on your search engine rankings. Dynamic URLs with many variables can lead to problems with search engine spiders.

There are many more factors that influence the position of your website in Google's search results.

Tuesday, March 10, 2009

What's new in SQL Server 2008 Reporting Services?

With the release of SQL Server 2008, Reporting Services (SSRS) reached its third version. While the previous version contained many new features, this latest release of SSRS is easier to use and an even more powerful enterprise-level reporting tool. This article will discuss the most significant and most useful new features that make SQL Server 2008 Reporting Services a worthwhile upgrade.

IIS is no longer required
The previous two releases relied on IIS to host the RS Web service and the Report Manager Web application. Many organizations have made a rule of not installing IIS on the same machine as SQL Server for security reasons. This dependency on IIS therefore became a showstopper for using SSRS for many potential users. Based on user feedback, Microsoft re-architected SSRS 2008 and removed this dependency.

SSRS now handles HTTP via HTTP.SYS, a native OS kernel that intercepts and handles HTTP requests just like IIS. If this sounds familiar, that's because it's the same mechanism used by the Native XML Web Services in SQL Server 2005 – a feature that, interestingly, was removed in SQL Server 2008. Microsoft also claims that this implementation provides better performance and scalability. The Reporting Services Configuration tool has been updated to provide management capabilities for Report Manager and Reporting Server services.

Better memory management
Users who have deployed SSRS in high-traffic environments may have noticed performance degradation when many concurrent users access reports with many pages. The reason for the slowness was that SSRS was holding rendered reports in memory, causing problems if memory demands became too high. SSRS now saves rendered pages to the file system to remove the load on server memory. For large reports, it also renders reports as the user is viewing them. This on-demand processing decreases the load on the server by not rendering pages the user might potentially not view at all. In addition to these memory enhancements, you can now also set a threshold on how much memory SQL Server Reporting Services should use.

Export into Microsoft Word
This feature has been requested pretty much since SSRS 2000 came out in 2004, mainly because there are more Word users than Excel users. For whatever reason, this feature did not make it to SQL Server 2005, but it's finally here and certainly a much welcome addition to all the other export options.

Improved charting components

Last year Microsoft acquired Dundas Data Visualization technology for SSRS. Dundas is a company specializing in developing powerful visual components for developers, such as chart, gauge, map and calendar controls. Purchasing the code base for the chart, gauge and calendar controls allowed Microsoft to provide better integration of these components with SSRS and to control the components' future direction. Dundas components have always been visually stunning, so it's no surprise that the new chart component is like night and day compared with chart control in the previous versions of SSRS.

In my opinion, if you do a lot of charting in your reports, this feature alone justifies the effort to upgrade SQL Server 2008 Reporting Services. The new chart region includes many additional chart types, such as cylinder, pyramid, radar and funnel. Other useful enhancements are interval labeling for avoiding label collisions, combining small slices into a single slice in a pie chart and more 2-D and 2-D visual effects.

Tablix data region
This new data region combines the functionality of both table and matrix data regions. This region allows you to combine multiple column groups and row groups, essentially allowing you to create reports that have both dynamic row and dynamic column output. You can now more easily create nested and recursive groups and make them adjacent to each other. There is a lot more to the Tablix data region than I can describe in this space. For more details, read the Understanding the Tablix Data Region section of Microsoft Books Online.

Enhanced CSV export

Comma-separated value (CSV) export can be very useful for automating report processing if you are interested in extracting data elements from the report. SSRS CSV export in previous versions included both the data and the layout in the export. But the inclusion of report layout elements in the CSV output created additional work and complications. The new output format is much cleaner and easier to use for automating programmatic processing.

There are plenty of significant reasons for you to give SSRS 2008 serious consideration. And remember, just like Integration Services, SSRS is a standalone tool you can install on a dedicated server. This means, if your organization is not ready to migrate to SQL Server 2008, you can upgrade just your reporting server and use these new features, using your 2000 or 2005 databases as data sources.

History of SQL Server Integration Services

Integration Services is the successor to Data Transformation Services (DTS). DTS had humble beginnings. It was started on a shoestring budget with very few resources. Its first incarnation was a proof-of-concept transformation, which later became known as the data pump. The proof of concept caught the attention of some folks around Microsoft, and it was given some funding.

The first release of DTS shipped with SQL Server 7.0 to receptive users. The alternatives at the time were either difficult to work with, expensive, or both. Many DBAs were forced to write custom transformation software, which was inflexible and difficult to maintain. Some tools had limitations, such as the need for source and destination schemas to match exactly, direct dependence on a particular database product, and/or no transformation capabilities. Many wrote custom parsing and transformation applications. For example, many companies are only now converting from hand-coded flat file parsers, SQL scripts, and transformation code to a standard platform such as Integration Services.

The first release of DTS addressed several of these issues and simplified life for a lot of people. By using OLEDB for its data access layer, DTS could access various data sources with little or no custom coding. DTS was also affordable because it shipped "in the box" with SQL Server. Users had access to all the power of more expensive products, yet incurred no additional cost for their ETL tools. This was obviously a benefit to IT shops trying to stretch their budgets. DTS was a flexible product that was easy to use. There were also a number of standard tasks in the box, including the Transform Data, Execute Process, Active X Script, Execute SQL, and Bulk Insert Tasks.

SQL Server 8.0 added even more functionality by adding more tasks. The Execute Package, FTP, and MSMQ Tasks added incremental improvements across the product. However, users experienced some frustration with DTS when attempting to work with large datasets and some of the other limitations inherent in a script-based tool. The time was ripe to create a truly enterprise-ready integration tool.

In 2000, SQL Server decided to make a substantial investment in the ETL and Integration space and brought together some talented folks who formulated the ideas behind the Data Flow Task and the next version of Integration Services. Over a period of five years, the development time frame for SQL Server 2005, the DTS team completely redesigned and rewrote DTS to become Integration Services 2005.

Integration Services 2008 is the next incarnation of that release. While not as revolutionary a release as in 2005, Integration Services 2008 brings some new capabilities and incremental improvements across the product.

What Is SQL Server Integration Services?

Depending on whom you ask, you might get different answers to that question ranging from descriptions such as a data import/export wizard, to an ETL tool, to a control flow engine, to an application platform, or to a high-performance data transformation pipeline. All are correct because Integration Services is a set of utilities, applications, designers, components, and services all wrapped up into one powerful software application suite. SQL Server Integration Services (SSIS) is many things to many people.

Data Import/Export Wizard
One of the most popular features of Integration Services is the Import/Export Wizard, which makes it easy to move data from a source location such as a flat file or database table to a flat file, table, or other destination. The Import/Export Wizard was the first utility developed back in the SQL Server 7.0 time frame and continues today as an important utility in the database administrator's (DBA) toolbox.

ETL tool
ETL is an acronym for Extract, Transform, and Load and describes the processes that take place in data warehousing environments for extracting data from source transaction systems; transforming, cleaning, deduplicating, and conforming the data; and finally loading it into cubes or other analysis destinations. Although Data Transformation Services (DTS), Integration Services' predecessor application, was considered a valuable tool for doing ETL, Integration Services is where true Enterprise ETL became available in SQL Server.

Control flow engine
The processes involved in moving data from location to location and transforming it along the way are not restricted to only processing data. Integration Services provides a control flow for performing work that is tangentially related to the actual processing that happens in data flow, including downloading and renaming files, dropping and creating tables, rebuilding indexes, performing backups, and any other number of tasks. Integration Services provides a full-featured control flow to support such activities.

Application platform
Developers can create applications that use Integration Services as a platform, embedding the engines within their application using the provided object models. As a developer, you can embed the Integration Services engines and components within your applications using the object models.

High performance data transformation data pipeline
That's a mouthful and really incorporates two ideas: high performance and data pipelining. The Data Flow Task is a high-performance tool because you can use it to perform complex data transformations on very large datasets for incredibly performant processing. The pipeline concept means that you can process data from multiple heterogeneous data sources, through multiple parallel sequential transformations, into multiple heterogeneous data destinations, making it possible to process data found in differing formats and on differing media in one common "sandbox" location.

Performance Tuning SQL Server Backup and Restore

SQL Server backup and restore performance is often a greater issue than many DBAs realize. This is because the time it takes to backup and restore a database plays a very large part in how large a single SQL Server database can be. For example, if it takes more than 24 hours to backup or restore a database, obviously this presents some problems.
What can sometimes happen to the DBA is that a database may be of reasonable size now, given their current server hardware, and backups and restores may take a reasonable amount of time. But as time goes by, the database grows, and at some point the amount of time to backup and restore the database becomes too long given the backup and restore window available. The moral to this story, keep a close eye on backup and restore times, and factor this in when projecting future hardware needs for your SQL Server databases. [6.5, 7.0, 2000, 2005]

Saturday, March 7, 2009

Reasons Why You May Not Want to Use a Heap

why heaps aren't usually a good choice when designing a database. Now, I didn't say that heaps are always bad, but in most cases, I think they should be avoided (read the list to see why).

As I compiled the list, in some cases I have had the opportunity to verify that they are true, but in other cases, I have not had the time to verify them. I would like your input on the list. Is there anything on the list that is not true, or only partially true (under some circumstances, but not others), and what other reason are there to avoid using heaps that have I left out?

I look forward to your feedback.

If non-clustered indexes are not added to a heap, then all queries against a heap will require table scans to retrieve the requested data. If the heap is large, then these queries will be very resource intensive and hurt SQL Server's overall performance.
Since the data in a heap is unordered, performing a table scan on a heap can cause a lot of extra I/O activity because inefficient random reads, not efficient sequential reads, are more the norm.
While a non-clustered index can be added to a heap to speed up some queries, in almost all cases, to make use of a non-clustered index, the use of a bookmark lookup is required. A bookmark lookup means that once the record(s) to be returned by the query are identified in the non-clustered index, additional reads (the bookmark lookup) must be made of the related rows in the heap, so that all of the data requested by the query is returned. This is not very I/O efficient, especially if many rows are returned. At some point, it may be faster for SQL Server to do a table scan than it is to use a non-clustered index when returning many rows.
Heaps cannot be replicated using SQL Server replication.
If you want to create an XML index on an XML data column, a clustered index must exist on the table.
If you want to create a spatial index on a spatial data column (GEOMETRY or GEOGRAPHY), a clustered index must exist on that table.
If a heap has a non-clustered index on it (as the primary key), and data is inserted into the table, two writes have to occur. One write for inserting the row, and one write for updating the non-clustered index. On the other hand, if a table has a clustered index as the primary key, inserts take only one write, not two writes. This is because a clustered index, and its data, are one in the same. Because of this, it is faster to insert rows into a table with a clustered index as the primary key than it is to insert the same data into a heap that has a non-clustered index as its primary key. This is true whether or not the primary key is monotonically increasing or not.
When data is updated in a heap, and the updated row is larger than the old row and can't fit into the old space, a forwarding record is inserted into the original location that points to the new location of the page. If this happens a lot, then there is a lot of space wasted in a database maintaining the forwarding records. This also contributes to additional I/O activity as both the pointer, and the row, have to be read.
Even if data updated in a heap is not larger than the old row (the updated data is smaller or the same size than the original data), updating a heap with a non-clustered primary key is slower than updating the same table that has a clustered index as the primary key. This is because updating a table with a clustered index is less write intensive than updating a heap with a non-clustered index as its primary key.
If a row is deleted from a heap with a non-clustered index as its primary key, it is slower than deleting the same row from the same table with a clustered index as its primary key. This is because it takes more I/O to perform this task on a heap than on a clustered index.
When data is deleted from a heap, the data on the page is not compressed (reclaimed). And should all of the rows of a heap page are deleted, often the entire page cannot be reclaimed. This not only wastes space, it contributes to fragmentation of the data pages within a database.
If you take two identical tables, one that is a heap with a non-clustered index as its primary key, and a table that has a clustered index as its primary key, the heap with the non-clustered index will be substantially larger, wasting valuable space and increasing disk I/O.
The ALTER INDEX rebuild and reorganize options cannot be used to defragment and reclaim space in a heap (but they can used to defragment non-clustered indexes on a heap). If you want to defragment a heap, you have three options: 1) create a clustered index on the heap, then drop the clustered index; 2) Use SELECT INTO to copy the old table to a new table; or 3) use BCP or SSIS to move the data from the old table to a new table.

Thursday, March 5, 2009

Importing and exporting bulk data with SQL Server's bcp utility

The Bulk Copy Program (bcp) utility in Microsoft SQL Server enables database administrators to import bulk data into a table or export it from a table into a file. It also supports several options that define how data will be exported, where it will be imported and which data will be loaded.
This tip discusses several examples of bcp commands to bulk-copy data in and out of SQL Server tables. The examples have been tested on SQL Server 2005 and SQL Server 2008, and for all examples I used the AdventureWorks sample database.

Exporting data with the bcp utility

One of the simplest operations that you can perform with the bcp utility is to bulk-copy data out of a SQL Server table into a text file. By entering the command at a Windows command prompt, you can run a bcp command. The following command, for example, copies the data from the Sales.vSalesPerson view in the AdventureWorks database to the C:\Data\SalesPerson.txt file:

bcp AdventureWorks.Sales.vSalesPerson out C:\Data\SalesPerson.txt -c –T

As you can see, the bcp command begins with the utility name, followed by the fully qualified name of the table (database.schema.table). The out keyword is next, which tells the bcp utility that the data will be exported from the table. The path and file name of the target text file follows the out keyword. Note that the command examples in this tip might wrap to multiple lines. All the examples, however, should be run as a single command.

In addition to these basic arguments, the bcp utility supports several switches that control the utility's behavior. In the example above, the -c switch indicates that all data is treated as character data, regardless of how it is stored in the source table. If you do not specify the -c switch or another type-related switch, you will be prompted to specify the type for each column after you enter the bcp command. You may notice that bcp command switches are case sensitive.

The next switch in the above example is -T, which tells the bcp utility to use a trusted connection to connect to the SQL Server instance. If you do not specify -T, you must provide your user name (the -U switch) and your password (the -P switch), or you will be prompted for this information.

Because no instance is specified in the preceding example, the utility uses the default instance on the local machine. To specify a SQL Server instance, use the -S switch, followed by the server name, as in the following example:

bcp AdventureWorks.Sales.vSalesPerson out C:\Data\SalesPerson.txt -c -T -S Server01

The bcp utility will now connect to the default instance on Server01. If you want to connect to an instance other than the default, you must specify the instance name along with the server name, as in Server01\SqlSrv.

By default, the bcp utility uses tabs to separate fields in the target file. You can, however, override that behavior by using the -t switch, as in the following example:

bcp AdventureWorks.Sales.vSalesPerson out C:\Data\SalesPerson.csv -c -T -t,

In this case, a comma is specified after the -t switch, which means that the data fields will now be separated by commas. This allows you to save the data to a .csv file, as I've done here, which can be handy if you want to view the data in a Microsoft Excel file.

These examples have been limited to exporting data from a table. But you can also use the bcp command to run a Transact-SQL query and export the query results. The following bcp command, for example, includes a SELECT statement that retrieves only the SalesPersonID, FirstName, and LastName columns from the vSalesPerson view:

bcp "SELECT SalesPersonID, FirstName, LastName FROM AdventureWorks.Sales.vSalesPerson"
queryout C:\Data\SalesPerson.csv -c -T -t,

In this scenario, the query, enclosed in quotes, is passed into the bcp command instead of the table name. In addition, the out keyword has been replaced with queryout. But the rest of the command is the same as the preceding example. As a result, the SalesPerson.csv file will now contain only the three specified columns. You can also refine your query further: You can include, for example, a WHERE clause that qualifies which rows are returned from the source table.

Importing data with the bcp utility

The bcp utility makes importing data as simple as exporting data. To run the examples in this section, first run the following T-SQL script to create the SalesPeople table in the AdventureWorks database:

USE AdventureWorks
GO
IF OBJECT_ID (N'SalesPeople', N'U') IS NOT NULL
DROP TABLE dbo.SalesPeople
GO
CREATE TABLE dbo.SalesPeople (
SalesPersonID INT IDENTITY PRIMARY KEY,
FirstName NVARCHAR(50) NOT NULL,
LastName NVARCHAR(50) NOT NULL
)

To import data, you need a source file from which to copy the data. For instance, the following command uses the recently created SalesPerson.csv file to load data into the SalesPeople table:

bcp AdventureWorks.dbo.SalesPeople in C:\Data\SalesPerson.csv -c -T -t,

First, you must specify the target table, followed by the in keyword that replaces out or queryout in this scenario. Next, you must specify the path and file name of the source file, followed by any applicable switches.

When you run this command and view the results, note that while the source file includes the salesperson IDs, these values are not inserted into the SalesPersonID column. The column is defined as an IDENTITY column, so the source data is ignored. To retain the original values, you must add the -E switch to the command, as in the following example:

bcp AdventureWorks.dbo.SalesPeople in C:\Data\SalesPerson.csv -c -T -t, -E

Your table will now include the desired data.

Using a format file

When importing or exporting data, you may find that the structure of your source data does not match the structure of the target data. The columns in a text file, for example, might not be in the same order as those in your target table, or the number of columns may be different. You can address this issue by creating a format file that maps the structures of the source and destination. Let's look at an example to demonstrate how this works.

Suppose you use the following command to export data from the vSalesPerson view to the SalesPeople.txt file:

bcp "SELECT LastName, FirstName, SalesPersonID FROM AdventureWorks.Sales.vSalesPerson"
queryout C:\Data\SalesPeople.txt -c -T -t,

This command uses the same arguments as those in previous examples. Note, however, the order in which the columns are retrieved from the view: LastName, then FirstName, and finally SalesPersonID.

Now suppose you plan to use this file to import data into the SalesPeople table. The columns in the table, however, are defined in a different order from that in the text file. To address this issue, you can create a format file to map the columns from the source to the destination. The following command demonstrates how to create a format file:

bcp AdventureWorks.dbo.SalesPeople format nul -f C:\Data\SalesPeople.fmt -c -T -t,

As previous examples show, the command first specifies the target table. This time, however, the table name is followed by the keywords format nul, which indicate that the bcp utility should create a format file. The -f argument is then used to specify a path and file name for the format file, followed by several switches. As a result, when you run this command, a format file is generated that contains the structure of the SalesPeople table.

The following data shows the content of the SalesPeople.fmt format file that is generated by the above command:

10.0
3
1 SQLCHAR 0 12 "," 1 SalesPersonID ""
2 SQLCHAR 0 100 "," 2 FirstName SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "\r\n" 3 LastName SQL_Latin1_General_CP1_CI_AS


The first row of the file (10.0) identifies the version of bcp that is being used. The second row (3) identifies the number of columns in the table, and the next three rows provide information about the columns themselves:


The first field indicates the order in which the columns should appear in the source document.


The second field shows the source file data type for each column. Because I specified the -c switch when I generated the file, a character data type is used for all fields as they are extracted from the data file. SQL Server will then convert this data into the correct types as the data is inserted.


The third field shows the prefix length for the field, which is used by SQL Server to provide the most compact file storage. A value of 0 is used automatically if you specify the -c switch when creating the format file.


The fourth field indicates the number of bytes for the data type of a particular field.


The fifth field shows how columns and rows are terminated. Because the -t switch was used when creating the format file, the field values of the source file must be terminated by commas.


The sixth field maps to the order in which the columns appear in the SQL Server table.


The seventh and final field provides the collation information for character columns in the SQL server table.

To use the format file to import data into the SalesPeople table, we must modify the file as follows:

10.0
3
1 SQLCHAR 0 100 "," 3 LastName SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "," 2 FirstName SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 12 "\r\n" 1 SalesPersonID ""


As you can see, the order of the columns has been modified to reflect how they appear in the format file. The SalesPersonID column is now listed last and it is terminated with \r\n. The LastName column is now listed first and is terminated with a comma.

After you modify and save the format file, it is ready to use in your bcp commands. The following example demonstrates how to call the format file:

bcp AdventureWorks.dbo.SalesPeople in C:\Data\SalesPeople.txt -f C:\Data\SalesPeople.fmt –T

Note that, when you import data from the SalesPeople.txt file, you must also use the -f switch to call the format file. Also notice that you no longer need to include the -t and -c switches because the information is now included in the format file.