The goal is to optimize a set of queries to run faster, improve performance. By knowing the meaning of data you can help query optimizer to develop a better query plan and in some cases you can override the query optimizer to follow a certain query plan.
Suppose you have a batch of queries, which you want to optimize. First you need to identify which query is creating the problem and than to identify where the problem lies in that query. Query that is taking maximum time is the first you want to consider for the optimization. To check which query is taking maximum time to execute place getdate() both before and after the query. Identity the type of the query whether it is a data look-up query or data modification query.
Use ‘set showplan on’ to see what query plan is executed by the SQL Server? If query is taking a lot of time to execute it is not possible to execute it again and again. Use ‘set noexec on’ to just build the query plan and not to execute it. When these two options are set SQL Server will parse the query, optimize it and develop a query plan, but will not submit it to execute.
As SQL server uses cost based query optimizer,it’s always better to turn on the following options to check the statistics of the query:
set statistics io on: gives actual logical and physical page reads incurred by the query
set statistics time on: gives total CPU and elapsed time to execute a query and the time to parse and compile the query
---------INDEX------------
Look at the query plan that is generated by the optimizer. Check whether the optimizer is picking up proper indexes for the SARG, OR clauses and Join operations. See whether it is using indexes to execute the query or not, and if yes which indexes are used. If there is an index on the table and optimizer is not using that index to resolve the query, try to figure out why optimizer is not using the index.
---------UPDATE STATISTICS-------------------
As a general advice, once the table gets 20% updated, we have to run the update statistics for the table. Update statistics for a table updates the distribution page[server keeps distribution information for each index on a separate page in the datebase](so distribution statistics) for the indexes used in the table.
Use ‘update statistics’ command to update the statistics for all indexes on a table, using following commands:
To update the statistics of all the indexes related to the table:
update statistics
To update the statistics of a particular index on the table:
update statistics
Run dbcc traceon 302 for getting the meta-information to know how the query optimizer is handling the index selection and OR clauses. Run dbcc traceon 310 for getting the meta-information to know how query optimizer is picking up the join order (the permutation details).
Again run that query and see if it using the indexes or not. If the query is still not using any indexes, and you think that an index can be used to execute the query faster, you can force the optimizer to use the index by specifying the index ID number after the table name in the query. SQL Server would use that index to satisfy the query. You can always use ‘set statistics io’ option to verify whether there was an I/O savings over the optimizers choice or not.
Identify which indexes to use to resolve the query: clustered or non-clustered. Clustered index is better for range queries, if data is in sorted order, column has number of duplicate values, and column is frequently referenced in the order by clause. Non-clustered index is better for single row lookups, queries containing joins, queries with small range retrieval.
Keep following things in mind for the indexes:
·Unless the OR strategy is applied, SQL server uses only one index per table to satisfy the query.
·Sometimes a table scan may be cheaper than a candidate index in terms of total I/O. e.g. large range retrievals on a non-clustered-index column.
·A non-clustered index that covers a query will be faster than a similarly defined clustered index.
----------JOIN AND SUBQUERY---------------
If the query contains any joins, look what is the join order selected by the optimizer. Joins are performed as a set of nested loops. Generally Join queries consume more memory than sub query, Subquery involves the table creation so affects performance, because of I/O fetches need for processing. Subquery is always better than join if we have hardware bottleneck (memory constraints). Check if you can replace join with the subquery. But if you can scale up your hardware (nowadays it is not a big deal) than go with join.
Order in which the join is performed has effect on the total number of I/O performed. By knowing the meaning of data you can select a particular join order and check if it improves performance. You can force join order with the ‘ set forceplan’ option:
set forceplan {on / off}
It might be possible that optimizer was using different indexes before forcing the join order, as certain indexes may not be useful for certain join orders. As the number of table increases, query optimizer takes more time to determine the join order. If the optimizer is picking the proper join order but taking a lot of time in determining the join order, you can force that particular join order and save time. One another reason query not picking up proper indexes may be “ set table count value” is set to lower number but the number of tables involved in the join operations is relatively higher.
-------------DML-------------
Another factor while optimizing the query is to avoid the transaction log in data modification queries. In the following cases we can avoid the transaction to be logged.
·Use fast BCP instead of slow BCP to avoid the data to be logged.
·Use truncate table instead of delete.
·Use select_into instead of creating temporary table to avoid the transaction log.
A table having indexes and triggers takes long time to perform data modification queries. Depending on how important the query is, you can think of dropping the indexes during the daytime and rebuilding the indexes at end of the day. For these types of queries, in many cases index usage is discouraged, because the update statements .
If you need to perform large data modification its better to drop indexes and rebuild indexes after the data modification operation is performed.
Another factor you want to consider is whether query is using any worktable/temporary table to resolve the query. Creating the worktable to solve the query always take much time to process. Queries containing ‘order by’, ‘group by’, ‘distinct’ always uses a worktable. Check if you can satisfy the query by avoiding these clauses. Probable solutions to avoid these clauses are:
·If column has a unique/primary key, than there is no need to specify the ‘distinct’ clause.
·If you have more than four tables to joins, it will definitely use a worktable to resolve the query. Try to keep the number of tables to join to minimum. Try to create subquery to avoid join of more than four tables.
·If there is an index on a column, which keeps the column sorted, you don’t need to specify ‘order by’ clause in the query.
·If you know that data is unique, than there is no need to specify the ‘group by’ clause in the query.
Check for the following scenarios in the where clause of the query:
·Negative Logic (!=, <>): negative logic always results in the table scan unless index covering is applied.
·Calculated comparison: if right side of an expression needs to be calculated first, it will not be evaluated until it reaches the execution stage.
·If a query is frequently using computed expressions, insert a column of the computed expression in the table. You need to write a trigger to maintain that column. e.g. suppose a query always calculates discounted price from a table, its better to insert a column of discounted price in the table.
A trigger gets fired for every transaction. It is another overhead in the query. Depending on how important the query is you can drop triggers and write a store procedure for that. And at end of the day you can run that stored procedure to perform the operations performed by the trigger. e.g. During the day you need to handle a lot of queries for OLTP, at that time you can consider to drop the trigger and run stored procedure at end of the day.
Depending upon the type of queries you can think of de-normalizing the tables. Following are the options to de-normalize the data depending upon the importance of the queries:
·Inserting calculated columns: it incurs another overhead to maintain the calculated column value. When a data row is inserted, you need to calculate the value of that column. Depending upon the importance of the query you can insert a calculated column.
·Vertical merge of tables: If a query always performs the join of two or more tables. Its better to merge the tables and build a new table to satisfy that query faster. Again it depends on how much important the query is and how you are going to manage the duplicate data. Duplicating the data improves the query to run faster but it incurs another overhead to maintain the duplicate data.
·Vertical split of tables: If query always select certain columns from the table, its better to separate the columns from the table and perform the query to improve the performance.
·Horizontal merge of tables: Merge the data horizontally. e.g. If a query shows the transaction done by the customer in the current month only and if the user wants to see the record for certain months, you can horizontally combine data from old transactions table and recent transactions table to give the result.
·Horizontal split of tables: Split the data horizontally. e.g. in a online banking scenario, split the data in most recent transactions and old transactions. You can just show the most recent transactions, instead of showing the whole transactions.
No comments:
Post a Comment