Monday, November 3, 2008

SQL basics: Query multiple tables

It’s sometimes difficult to know which SQL syntax to use when combining data that spans multiple tables. I’ll discuss some of the more frequently used methods for consolidating queries on multiple tables into a single statement.

The sample queries in this article adhere to the SQL92 ISO standard. Not all database manufacturers follow this standard, and many have made enhancements that can yield unexpected results. If you’re uncertain about support for these concepts in your database, please refer to your manufacturer’s documentation.

SELECT
A simple SELECT statement is the most basic way to query multiple tables. You can call more than one table in the FROM clause to combine results from multiple tables. Here’s an example of how this works:
SELECT table1.column1, table2.column2 FROM table1, table2 WHERE table1.column1 = table2.column1;

In this example, I used dot notation (table1.column1) to specify which table the column came from. If the column in question only appears in one of the referenced tables, you don’t need to include the fully qualified name, but it may be useful to do so for readability.

Tables are separated in the FROM clause by commas. You can include as many tables as needed, although some databases have a limit to what they can efficiently handle before introducing a formal JOIN statement, which is described below.

This syntax is, in effect, a simple INNER JOIN. Some databases treat it exactly the same as an explicit JOIN. The WHERE clause tells the database which fields to correlate, and it returns results as if the tables listed were combined into a single table based on the provided conditions. It’s worth noting that your conditions for comparison don’t have to be the same columns you return as your result set. In the example above, table1.column1 and table2.column1 are used to combine the tables, but table2.column2 is returned.

You can extend this functionality to more than two tables using AND keywords in the WHERE clause. You can also use such a combination of tables to restrict your results without actually returning columns from every table. In the example below, table3 is matched up with table1, but I haven’t returned anything from table3 for display. I’ve merely checked to make sure the relevant column from table1 exists in table3. Note that table3 needs to be referenced in the FROM clause for this example.
SELECT table1.column1, table2.column2 FROM table1, table2, table3 WHERE table1.column1 = table2.column1 AND table1.column1 = table3.column1;

Be warned, however, that this method of querying multiple tables is effectively an implied JOIN. Your database may handle things differently, depending on the optimisation engine it uses. Also, neglecting to define the nature of the correlation with a WHERE clause can give you undesirable results, such as returning the rogue field in a column associated with every possible result from the rest of the query, as in a CROSS JOIN.

If you’re comfortable with how your database handles this type of statement, and you’re combining two or just a few tables, a simple SELECT statement will meet your needs.

JOIN
JOIN works in the same way as the SELECT statement above—it returns a result set with columns from different tables. The advantage of using an explicit JOIN over an implied one is greater control over your result set, and possibly improved performance when many tables are involved.

There are several types of JOIN—LEFT, RIGHT, and FULL OUTER; INNER; and CROSS. The type you use is determined by the results you want to see. For example, using a LEFT OUTER JOIN will return all relevant rows from the first table listed, while potentially dropping rows from the second table listed if they don’t have information that correlates in the first table.

This differs from an INNER JOIN or an implied JOIN. An INNER JOIN will only return rows for which there is data in both tables.

Use the following JOIN statement for the first SELECT query above:
SELECT table1.column1, table2.column2 FROM table1 INNER JOIN table2
ON table1.column1 = table2.column1;

Subqueries
Subqueries, or subselect statements, are a way to use a result set as a resource in a query. These are often used to limit or refine results rather than run multiple queries or manipulate the data in your application. With a subquery, you can reference tables to determine inclusion of data or, in some cases, return a column that is the result of a subselect.

The following example uses two tables. One table actually contains the data I’m interested in returning, while the other gives a comparison point to determine what data is actually interesting.
SELECT column1 FROM table1 WHERE EXISTS ( SELECT column1 FROM table2 WHERE table1.column1 = table2.column1 );

One important factor about subqueries is performance. Convenience comes at a price and, depending on the size, number, and complexity of tables and the statements you use, you may want to allow your application to handle processing. Each query is processed separately in full before being used as a resource for your primary query. If possible, creative use of JOIN statements may provide the same information with less lag time.

UNION
The UNION statement is another way to return information from multiple tables with a single query. The UNION statement allows you to perform queries against several tables and return the results in a consolidated set, as in the following example.
SELECT column1, column2, column3 FROM table1 UNION SELECT column1, column2, column3 FROM table2;

This will return a result set with three columns containing data from both queries. By default, the UNION statement will omit duplicates between the tables unless the UNION ALL keyword is used. UNION is helpful when the returned columns from the different tables don’t have columns or data that can be compared and joined, or when it prevents running multiple queries and appending the results in your application code.

If your column names don’t match when you use the UNION statement, use aliases to give your results meaningful headers:
SELECT column1, column2 as Two, column3 as Three FROM table1 UNION SELECT column1, column4 as Two, column5 as Three FROM table2;

As with subqueries, UNION statements can create a heavy load on your database server, but for occasional use they can save a lot of time.

Multiple options
When it comes to database queries, there are usually many ways to approach the same problem. These are some of the more frequently used methods for consolidating queries on multiple tables into a single statement. While some of these options may affect performance, practice will help you know when it’s appropriate to use each type of query.

No comments: