Thursday, September 27, 2007

(ZT)Five Ways to Rev up Your SQL Performance

 


Johnny Papa
Data Points Archive

S

ometimes all it takes is a little tweak here or there to make your application run much faster. Ah, but the key is figuring out how to tweak it! Sooner or later you'll face a situation where a SQL query in your application isn't responding the way you intended. Either it doesn't return the data you want or it takes entirely too long to be reasonable. If it slows down a report or your enterprise application, users won't be pleased if they have to wait inordinate amounts of time. And just like your parents didn't want to hear why you were coming in past curfew, users don't want to hear why your query is taking so long. ("Sorry, Mom, I used too many LEFT JOINs.") Users want applications to respond quickly and their reports to return analytical data in a flash. I myself get impatient when I surf the Web and a page takes more than ten seconds to load (OK, more like five seconds).
To resolve these issues, it is important to get to the root of the problem. So where do you start? The root cause is usually in the database design and the queries that access it. In this month's column I'll demonstrate four techniques that can be used to either improve your SQL Serverâ„¢-based application's performance or improve its scalability. I'll examine the use of LEFT JOINs, CROSS JOINs, and retrieving an IDENTITY value. Keep in mind that there is no magic solution. Tuning your database and its queries takes time, analysis, and a lot of testing. While the techniques here are proven, some may work better than others in your application.
Returning an IDENTITY From an INSERT
I figured I would start with something I get a lot of questions about: how to retrieve an IDENTITY value after performing a SQL INSERT. Often, the problem is not how to write the query to retrieve the value, but rather where and when to do it. In SQL Server, the statement to retrieve the IDENTITY value created by the most recent SQL statement run on the active database connection is as follows:

SELECT @@IDENTITY
While this SQL is far from daunting, it is important to keep in mind that if the most recent SQL statement was not an INSERT or you run this SQL against a different connection than the INSERT SQL, you will not get back the value you expect. You must run this code to retrieve the IDENTITY immediately following the INSERT SQL and on the same connection, like this:
INSERT INTO Products (ProductName) VALUES ('Chalk')

SELECT @@IDENTITY
Running these queries on a single connection against the Northwind database will return to you the IDENTITY value for the new product called Chalk. So in your Visual Basic® application using ADO, you could run the following statement:
Set oRs = oCn.Execute("SET NOCOUNT ON;INSERT INTO Products _
(ProductName) VALUES ('Chalk');SELECT @@IDENTITY")

lProductID = oRs(0)
This code tells SQL Server not to return a row count for the query, then executes the INSERT statement and returns the IDENTITY value just created for the new row. The SET NOCOUNT ON statement means the Recordset that is returned has one row and one column that contains the new IDENTITY value. Without this statement, an empty Recordset is returned (because the INSERT statement returns no data) and then a second Recordset is returned, which contains the IDENTITY value. This can be confusing, especially since you never intended the INSERT to return a Recordset. This situation occurs because SQL Server sees the row count (that is, one row affected) and interprets it as representing a Recordset. So the true data is pushed back into a second Recordset. While you can get to this second Recordset using the NextRecordset method in ADO, it is much easier (and more efficient) if you can always count on the Recordset being the first and only one returned.
While this technique gets the job done, it does require extra code in the SQL statement. Another way of getting the same result is to use the SET NOCOUNT ON statement preceding the INSERT and to put the SELECT @@IDENTITY statement in a FOR INSERT trigger on the table, as shown in the following code snippet. This way, any INSERT statement into that table will automatically return the IDENTITY value.
CREATE TRIGGER trProducts_Insert ON Products FOR INSERT AS 
SELECT @@IDENTITY
GO
The trigger only fires when an INSERT occurs on the Products table, so it always will return an IDENTITY after a successful INSERT. Using this technique, you can consistently retrieve IDENTITY values in the same manner across your application.
Inline Views Versus Temp Tables
Queries sometimes need to join data to other data that may only be gathered by performing a GROUP BY and then a standard query. For example, if you want to return the information about the five most recently placed orders, you would first need to know which orders they are. This can be retrieved by using a SQL query that returns the orders' IDs. This data could be stored in a temporary table, a common technique, and then joined to the Product table to return the quantity of products sold on those orders:
CREATE TABLE #Temp1 (OrderID INT NOT NULL, _
OrderDate DATETIME NOT NULL)

INSERT INTO #Temp1 (OrderID, OrderDate)
SELECT TOP 5 o.OrderID, o.OrderDate
FROM Orders o ORDER BY o.OrderDate DESC

SELECT p.ProductName, SUM(od.Quantity) AS ProductQuantity
FROM #Temp1 t
INNER JOIN [Order Details] od ON t.OrderID = od.OrderID
INNER JOIN Products p ON od.ProductID = p.ProductID
GROUP BY p.ProductName
ORDER BY p.ProductName

DROP TABLE #Temp1
This batch of SQL creates a temporary table, inserts the data into it, joins other data to it, and drops the temporary table. This is a lot of I/O for this query, which could be rewritten to use an inline view instead of a temporary table. An inline view is simply a query that can be joined to in the FROM clause. So instead of spending a lot of I/O and disk access in tempdb on a temporary table, you could instead use an inline view to get the same result:
SELECT p.ProductName, 
SUM(od.Quantity) AS ProductQuantity
FROM (
SELECT TOP 5 o.OrderID, o.OrderDate
FROM Orders o
ORDER BY o.OrderDate DESC
) t
INNER JOIN [Order Details] od ON t.OrderID = od.OrderID
INNER JOIN Products p ON od.ProductID = p.ProductID
GROUP BY
p.ProductName
ORDER BY
p.ProductName
This query is not only more efficient than the previous one, it's shorter. Temporary tables consume a lot of resources. If you only need the data to join to other queries, you might want to try using an inline view to conserve resources.
Avoid LEFT JOINs and NULLs
There are, of course, times when you need to perform a LEFT JOIN and use NULL values. But they are not a solution for all occasions. Changing the way you structure your SQL queries can mean the difference between a report that takes minutes to run and one that takes only seconds. Sometimes you have to morph the data in a query to look the way your application wants it to look. While the TABLE datatype reduces resource gluttony, there are still plenty of areas in a query that can be optimized. One valuable, commonly used feature of SQL is the LEFT JOIN. It can be used to retrieve all of the rows from a first table and all matching rows from a second table, plus all rows from the second table that do not match the first one. For example, if you wanted to return every Customer and their orders, a LEFT JOIN would show the Customers who did and did not have orders.
This tool can be overused. LEFT JOINs are costly since they involve matching data against NULL (nonexistent) data. In some cases this is unavoidable, but the cost can be high. A LEFT JOIN is more costly than an INNER JOIN, so if you could rewrite a query so it doesn't use a LEFT JOIN, it could pay huge dividends (see the diagram in Figure 1).
Figure 1 Query
Figure 1 Query
One technique to speed up a query that uses a LEFT JOIN involves creating a TABLE datatype and inserting all of the rows from the first table (the one on the left-hand side of the LEFT JOIN), then updating the TABLE datatype with the values from the second table. This technique is a two-step process, but could save a lot of time compared to a standard LEFT JOIN. A good rule is to try out different techniques and time each of them until you get the best performing query for your application.
When you are testing your query's speed, it's important to run it several times and take an average. Your query (or stored procedure) could be stored in the procedure cache in SQL Server's memory and thus would appear to take longer the first time and shorter on all subsequent tries. In addition, other queries could be running against the same tables while your query runs. This could cause your query to stand in line while other queries lock and unlock tables. For example, if you are querying while someone is updating data in that table, your query may take longer to execute while the update commits.
One of the easiest ways to avoid slowdowns with LEFT JOINs is to design the database around them as much as possible. For example, let's assume that a product may or may not have a category. If the product table stores the ID of its category and there was no category for a particular product, you could store a NULL value in the field. Then you would have to perform a LEFT JOIN to get all of the products and their categories. You could create a category with the value of "No Category" and thus specify the foreign key relationship to disallow NULL values. By doing this, you can now use an INNER JOIN to retrieve all products and their categories. While this may seem like a workaround with extra data, this can be a valuable technique as it can eliminate costly LEFT JOINs in SQL batches. Using this concept across the board in a database can save you lots of processing time. Remember, even a few seconds means a lot to your users, and those seconds really add up when you have many users accessing an online database application.
Use Cartesian Products Wisely
For this tip, I will go against the grain and advocate the use of Cartesian products in certain situations. For some reason, Cartesian products (CROSS JOINS) got a bad rap and developers are often cautioned not to use them at all. In many cases, they are too costly to use effectively. But like any tool in SQL, they can be valuable if used properly. For example, if you want to run a query that will return data for every month, even on customers that had no orders that particular month, you could use a Cartesian product quite handily. The SQL in Figure 2 does just that.
While this may not seem like magic, consider that if you did a standard INNER JOIN from Customers to Orders, grouped by the month and summed the sales, you would only get the months where the customer had an order. Thus, you would not get back a 0 value for the months in which the customer didn't order any products. If you wanted to plot a graph per customer showing every month and its sales, you would want the graph to include 0 month sales to identify those months visually. If you use the SQL in Figure 2, the data skips over the months that had $0 in sales because there are no rows in the Orders table for nonsales (it is assumed that you do not store what did not occur).
The code in Figure 3 is longer, but can achieve the same goal of getting all the sales data, even for months without sales. First, it grabs a list of all of the months in the past year and puts them in the first TABLE datatype table (@tblMonths). Next, the code gets a list of all customers' company names who had sales during that time period and puts them in another TABLE datatype table (@tblCus-tomers). These two tables store all of the basic data required to create the resultset except the actual sales numbers.
All of the months are listed in the first table (12 rows) and all of the customers who had sales in that time frame are listed in the second table (81 for me). Not every customer purchased a product in each of the past 12 months, so performing an INNER or LEFT JOIN won't return every customer for every month. These operations will only return the customers and the months when they did purchase something.
A Cartesian product can return all customers for all months. A Cartesian product basically multiplies the first table by the second table and results in a rowset that contains the number of rows in the first table times the number of rows in the second table. Thus, the Cartesian product returns 972 rows into the table @tblFinal. The last steps are to update the table @tblFinal with the monthly sales totals for each customer during the date range and to select the final rowset.
Use CROSS JOINs with caution if you do not need a true Cartesian product because they can be very resource intensive. For example, if you do a CROSS JOIN on products and categories and then use a WHERE clause, DISTINCT or GROUP BY to filter out most of the rows, you could have gotten to the same result in a much more efficient manner by using an INNER JOIN. Cartesian products can be very useful when you need the data returned for all possibilities, as in the case when you want to load a graph with monthly sales dates. But you should not use them for other purposes as INNER JOINs are much more efficient in most scenarios.
Odds and Ends
Here are a few other common techniques that can help improve the efficiency of your SQL querying. Let's assume you are going to group all of your salespeople by region and sum their sales, but you only want salespeople who were marked active in your database. You could group the salespeople by region and use a HAVING clause to eliminate the salespersons who are not active, or you could do this in the WHERE clause. Doing this in the WHERE clause reduces the number of rows that need to be grouped, so it is more efficient than doing it in the HAVING clause. Filtering row-based criteria in the HAVING clause forces the query to group data that could have been eliminated in the WHERE clause.
Another efficiency trick is to use the DISTINCT keyword to find a distinct list of data rows instead of using the GROUP BY clause. In this case, the SQL using the DISTINCT keyword will be more efficient. Reserve use of the GROUP BY for occasions when you need to calculate an aggregate function (SUM, COUNT, MAX, and so on). Also, avoid using the DISTINCT keyword if your query will always return a unique row on its own. In that case, the DISTINCT keyword will only add overhead.
You've seen that numerous techniques can be employed to optimize queries and implement specific business rules; the trick is to try a few and compare their performance. Most important is to test, test, and test again. In future installments of this column, I'll continue to explore SQL Server concepts including database design, good indexing practices, and SQL Server security paradigms.

0 Comments:

Post a Comment

<< Home