What Is Wrong With This SQL?By Sergei Romanenko
This article describes quick and easy method of finding which step in execution plan is causing problem. Basic understanding of Oracle Explain plan output is required.
Here is a common situation: you are writing SQL statement which executes slow. Or you found in the database a poor-performing SQL query. In real world you rarely deal with simple statements like this: "select * from emp, dept where emp.dept_id=dept.dept_id". Quite often you have to deal with queries having dozens of joins. This can get even worse if views are used hiding details of underlying tables. So, your first reaction is to run Explain plan, which throws at you lines and lines (can be several hundred!) of cryptic details. Crazy?
Tried using the latest Oracle SQL monitor? Yes, you should be able to find the problem, although you will need to identify running session/statement and invoke SQL Monitor. It often breaks when SQL statement has large execution plan. And you need to have a additional license to use it. All these makes this option less appealing.
Meanwhile there is a simpler way to do the same, which works in 99% of cases, and requires no extra licenses. Starting with Oracle 10.2 there is a feature called "SQL Plan Statistics". While executing SQL statement, Oracle collects most important statistics on each step of the execution plan. For example, one of the most interesting statistic is "Elapsed Time". Plan statistics are exposed through the v$sql_plan_statistics view. By default Oracle doesn't collect this statistics. To turn this feature ON, execute 'alter session set STATISTICS_LEVEL=ALL;'. Also we can force plan statistics by adding GATHER_PLAN_STATISTICS hint to the query.
It would be absolutely terrific if Oracle could update statistics in v$sql_plan_statistics view while query is executing. By recording snapshots of this data someone could create a tool far surpassing SQL Monitor, showing not only the real time picture, but also providing entire dynamics of execution. Unfortunately, this is not a case. Oracle updates v$sql_plan_statistics when SQL execution finishes or interrupted. If execution is interrupted, statistics are updated with the counts collected to the moment of interruption. This behavior will be very interesting for us, because we don't want to wait until slow query finishes, which may never happen. Instead, we will assume that execution will quickly pass well-performing steps and quite soon will be stuck on problem steps. The statistics will show these problem steps. Time needed before the interruption depends on the query and typically can be between 10 seconds and few minutes. You can find it empirically by executing/interrupting few times. By our experience, this was not an issue. Of course, if the query finishes reasonably quickly, it is even better, although it is not critical.
See How It Works In Practice
Now you know the principles employed in this method. As you guess, all the supporting work can be automated and put into a nice tool. We are going to show how Lab128 Freeware tool makes it all work.
We will use SQL Central built-in editor to run the query. In Lab128 Freeware open SQL Central:
Connect to Oracle database. Make sure that "Enable Plan Stats" checkbox is checked. Behind the scene, this causes execution of 'alter session set STATISTICS_LEVEL=ALL;' in the current session:
Paste the query into the editor (or open a file), and execute it. If the query is going to run for uncertain amount of time, stop it in a minute or so. Then you should see Last SQL ID link:
Click on this link to open "Explain Plan" window. Make sure that "Exec Stats" and "Last Ex Only" boxes are checked. Also it is easier to analyze results if the "Pipeline" option is checked.
The execution plan will be shown with statistics on each step (in blue color). This window can contain a lot of information depending on complexity of the SQL statement, although it is not necessary to sort through this data. Lab128 analyses elapsed time of each step and then highlights a path to the most "expensive" step. Just follow the icons marked in yellow. In the picture below, see yellow arrows pointing to the icons. If you checked "Pipeline" mode, the most expensive step will be farthest to the right and to the bottom of all yellow icons. Plan statistics values, which you should be checked first, are in bold red (on the picture orange arrow points to it):
This is where problem exists. Note, we didn't spend any time focusing on insignificant steps, we jump over to the root problem right away!
You Know What Is Wrong, What Is Next?
It depends on particular SQL statement and type of database. For example, quite often you may see that unselective index is range-scanned, and that causes too many rows fetched to be thrown right away on the next step by a filter condition. It can be the order of joins far from being optimal, sometimes even missed join. Also it is important to compare how many rows actually were accessed and how many were estimated by SQL optimizer. For example, in the picture shown above, the query was supposed to run in sub-second range in OLTP environment, but instead taking about a minute. In the test run it was interrupted after 14 seconds. As it happened, the query was spending most of the time range-scanning index. By itself this operation was not a big deal. The real problem was that this range scan operation was called 1.5 million times. This was inappropriate for supposedly fast query. Obviously these 1.5M rows were generated by MERGE JOIN (cartesian) step, line 8, which hinted to a missing join in this query (see blue pointers). As it turned out, a missing join was, indeed, a root cause of the problem.
Here is another example of real-life query. It was taking one hour to execute. In our test run, the query was stopped after running 1 minute. The cause of problem is really simple, a full scan, "TABLE ACCESS (full)" step, line 71. This would be difficult to locate given there were legitimate full scans in this 86-line execution plan!
And another example from OLTP database. The query showed up as top query by CPU usage in AWR report. Because overall CPU usage was very high and the query was frequently executed, it was logical to take a close look at its execution. It took 8 seconds to execute in our test environment. Lab128 showed us steps where we should start investigation: line 27 and 28. Our first reaction could be to blame unselective index causing wide range scans. Indeed, 2.2 million index rows were accessed in 2,094 range scans (starts), or 1056 rows per scan (see line 28). This is not very efficient because after accessing table by row IDs in the next step and applying filters only 65 rows were left for further processing, see line 27.
But wait a minute! Why there were 2,094 starts/scans? After all, every single of them was doing exactly same work. So, this was utterly inefficient, it was badly burning CPU cycles! At this point we have to scroll up and see how this subquery was invoked.
See the next picture, a node on line 22 was folded/collapsed to have all needed details at once. What happened was the NESTED LOOP, line 4, joining the result set returned in line 5, which produced 2,094 rows (and very quickly, in 0.169 seconds) with subquery starting on line 23. So, for each row in (5), there was execution of subquery (23)! Obviously Nested Loop was completely suboptimal, the Optimizer did this bad decision under wrong estimation of cardinality in line 5: 1 row versus 2094 rows actually returned. In order to fix this query, someone will have to investigate this discrepancy, or force HASH join replacing NESTED LOOP join.
Note, that we didn't look at the query text and yet we were able to get a clear picture of the problem. No guess work, we just followed facts. Such type of troubleshooting can't be done that quickly and efficiently using any other technique.
Covering all possible problems is out of the scope of this article. But hopefully now you understand basics: start with trouble step, see all the facts, then investigate linked steps. Fortunately this forensic work is logical and with some experience becomes straight forward exercise.
How to Troubleshoot Already Running Bad Query
The method described above should very useful for SQL developers: Write the query; execute it; see what's wrong; fix it; re-execute and check again. Can it be used if the bad query already running?
Yes. The query can be staged to run in Lab128, the bind variables can be captured and used in the execution. This feature is integrated into "SQL Bind Variables Capture" report. The report can be called in many ways while using Lab128 for monitoring and troubleshooting. For example, in "Sessions" window sessions are sorted by LC_ET (Last Call Elapsed Time) column, see picture below. This column tells us how long session was executing currently running SQL statement. There is a report query running 1383 seconds (23 minutes):
This can be a problem if business specifications allows this report query to run no longer than 1 minute. So we would like to stage execution of this query same way as we did in previous chapters. Of course, we could copy text of the query and run, this would be perfectly OK if there were no bind variables. In general this is not true, and our runaway query is using binds. Therefore, first let's capture bind variables by calling "SQL Bind Variables Capture" report: right-click on SQL_ID then select corresponding menu item as shown on the picture above.
The report presents captured bind variables. Also it contains a link "Stage in SQL Central" which conveniently copies SQL text and binds into the new SQL Central window. Then everything will be ready for the test run as described above.
SQL Plan Statistics combined with proper tool provide powerful way to troubleshoot and diagnose poor-performing queries. It has been demonstrated how this method is implemented in Lab128. The advantage of this method is that it provides starting point in analysis, eliminating guess work and tremendously speeding up troubleshooting process.
That's all! You are armed with efficient technique and free software, so now is your turn. Please send your feedback to: or leave comments below.
Note: All examples in this article were done using Lab128 Freeware version 1.6.4 and Oracle Database 126.96.36.199.