Oracle Tuning Using Frequent SQL Statistics Snapshots (FSSS) MethodBy Sergei Romanenko
This article describes how to locate problem SQL more accurately and efficiently. The new FSSS method proved to outperform other methods in troubleshooting of real life production databases. The examples demonstrate FSSS method in action.
Introduction. Evolution of v$sql and v$sqlstats views
Oracle database provides v$sql view which contains performance statistics of cached SQL statements. The number of statistics has been increasing over several versions and in 11g there is over 30 statistics providing deep insight into performance of individual SQL statements. Most of statistics in v$sql are cumulative - as soon as SQL statement is executed, statistic values are incremented by the number tallied by the execution. Starting in Oracle 10g, v$sql became even more advanced, as statistics being updated during the execution before it is over. This can be used to track SQL execution dynamically in near-real time.
The v$sql view returns values accumulated since the time when SQL statement was first parsed. Even though it has some limitation, it is very easy to find top SQL in different categories, such as Elapsed Time, CPU Time, Buffer Gets, Disk Reads - to name just few of them. This is valuable, but it has severe limitation due to the fact, that SQL area has a finite size and acts as a cache - it contains the most recent and frequently executed SQL statements. If the database usage pattern changes, the composition of SQL area changes too. Therefore, there is another technique used to find top SQL in particular time interval: taking snapshot of v$sql at time t1, another one at t2, and then calculating delta between them. This technique can be extremely useful (although daunting without automation) for narrow time intervals. Narrow - means that the composure of SQL area didn't change dramatically during this time interval. This in turn implies that any significant SQL (in terms of performance) during that interval still remains in SQL area at time t2. One variation of this technique is flushing SQL area by running 'alter system flush shared_pool' command at time t1, which effectively "zeroes" everything, eliminating needs of the first snapshot and calculating delta. But this variation is hardly acceptable in the production system.
In Oracle 10g Release 2, a new view v$sqlstats was introduced, it aggregates v$sql data per unique combination of SQL text and execution plan, and it is supposedly much faster and more scalable. This view is updated on a regular base, close to 5 seconds. It has column with time of last update of the row, which helps querying incrementally, only returning rows updated since previous run. Combination of all these properties begs for a tool able to store series of snapshots taken several times per minute, enabling very sophisticated platform for performance troubleshooting.
Challenges to overcome
First, it can be very expensive to store that amount of data. A large Oracle instance can easily have 5-10 thousand SQL statements in the SQL area. So if we decide to store 64-bit numeric values for 30 statistics, this will translate into 1.2 - 2.4 MB. To keep a one-hour history of snapshots taken every 15 seconds will require almost a half of gigabyte. This approach doesn't look practical. Fortunately, we can focus on limited number of most expensive queries, say 100. We will have to keep only last full snapshot of all statements; when getting a new one, delta can be calculated and the most expensive 100 identified. Then statistics for these top queries can be stored. Different criteria can be used to measure cost of the SQL, such as Elapsed Time, Buffer Gets, Disk Reads, or combination of them. There are other subtle challenges, such as avoiding replacement of SQL statement too soon from the set of tracked statements, so the algorithm based on cost and recency should be used when replacing SQL.
Benefits of the new FSSS method
The new method provides very accurate performance data on SQL statements. It can be used in any type of SQL troubleshooting, particularly for troubleshooting of short spikes of resources usage where other methods lack precision. This method is particularly indispensable in the environment with high frequency short queries, such as Online Transaction Processing (OLTP) Systems. It provides precise and reliable data on the cause of performance problem. There is a number of unique features of this method, such as visualization of patterns of execution and row processing; physical/logical read ratios of individual SQL statements as a function of time, etc.
Examples of troubleshooting using the new method
Example 1. Troubleshooting CPU spikes
The following examples have been taken from Lab128 performance troubleshooting tool, where this technique was first implemented in 2007.
The first picture shows troubleshooting of random spikes in the CPU loads. We will select time interval just around one spike.
The tabular view shows delta calculated from collected snapshots. Each row represents a query. Rows are sorted by CPU usage in descending order. We can assume that the top statement is responsible for the CPU spikes. To confirm that, we added a graph with CPU usage for this individual statement (shown in blue color):
It's easy to see correlation between spikes. As we found problematic SQL statement, now we can either optimize it or research other options, such as avoiding or rescheduling execution, to name few. However, these actions are beyond the scope of this article.
Another example shows even worse situation where CPU gets saturated for about a minute, causing performance degradation and noticeable increase in response time during these intervals.
We need to repeat troubleshooting process. First, we select time interval (already shown in the picture). The tabular view shows SQL statements sorted by CPU usage:
Again, adding a graph of the first SQL statement reveals correlation:
High temporal resolution allowed finding the cause of intermittent CPU clogging quickly and with high certainty and precision.
Example 2. Troubleshooting Input/Output intermittent problems
Same technique is employed for tracking down usage of system resources. In this example, we are concerned about I/O problems caused by relatively short intervals of time when physical reads are excessive.
We are going to find SQL statements responsible for this I/O activity. The time interval is selected and now we sort data in tabular view by "Disk Reads":
Also we added graphs for top four SQL statements (in blue color) in the same order as they appeared in the top list. It is obvious that first three entries were a culprit. In fact, the first one is an anonymous PL/SQL block that launched the other two SELECT statements. As we see, the numbers reported for anonymous block were aggregates of individual SQL statements comprising the block. Some other findings on these statements: there were many executions (4649) of the anonymous block or 15 /second. The SELECT statements had similar frequency. The number of Gets per Execution was small - less than 5/exec, which indicates usage of selective indexes and excellent efficiency of the queries. Yet, the Miss Ratio was quite high: up to 40%. Based on that, we can guess there was a random access of database blocks over a large portion of corresponding tables and indexes. Database cache was not big enough to cache all these blocks, resulting in disk reads. This may hint at increase of DB cache, but this may not be possible or practical, depending on size of involved tables. This may prompt to redesign how data is accessed. There can be a number of possible solutions to this problem but this discussion is out of the scope of this article. The purpose of this example was to show how to find the cause of abnormal I/O activity quickly and reliably.
On the picture above, we included the 4th top statement, just in case, to see that it had very different I/O pattern. As far as intermittent I/O spikes were a concern, this 4th statement (which happened to be another anonymous PL/SQL block) was unrelated at all.
How this method can be used by Developers or Performance Analysts
Because this method can provide precise statistics about individual SQL statements, it can be used by developers to obtain a score of execution data from the database side. For example, this can visualize execution pattern of individual statement, as shown on this picture:
The developer was interested in a particular query "SELECT IT_ITEM..." which was found in the tabular view. Then she wanted to see the execution pattern and added a graph with Executions of this statement (shown in blue color).
This particular SQL statement was executed during some time intervals. So, the interval was selected and it was easy to find that there were 6716 executions, or about 18 per second. Also it could be seen that this statement was running in series of quick executions, having on average 0.04 second of elapsed time per execution. There were only 4 gets per execution, so this SQL was very efficient. Also we can see that large number of accessed blocks were not found in buffer cache and read from disks - the Miss Ratio was 39%. And probably developer would be glad to know that on average there were 3.8 rows/exec returned by this SQL.
If necessary, other statistics for this SQL statement can be seen both in numbers and graphically:
Total elapsed time for all executions was 289 microseconds; almost all of that time was spent on reads as reflected by "USER I/O Wait Time" - 288 microseconds. This should be expected, because this SQL statement was accessing data from very large table using indexes. It was not possible to cache the entire table and index into the buffer cache.
Oracle provides detailed performance data on individual SQL statements though v$sql and v$sqlstats views. Recording this data at small time interval allows troubleshooting of intermittent performance problems not possible by other techniques. Troubleshooting more persistent problems can be done as well, with higher precision and reliability than any other techniques. This method is actively used by Lab128 performance troubleshooting tool.