OpenGeo

Introduction to PostGIS

Table Of Contents

Previous topic

Section 12: Tuning PostgreSQL for Spatial

Next topic

Section 14: Visualisation

Section 13: Query Plan Analysis

A query plan is an unambiguous set of steps that PostgreSQL will step through to complete a query. This may include searching tables or indices for records, joining sets of records together, updating or deleting records, or formatting results as requested. When any query is executed, PostgreSQL develops a number of potential query plans describing the steps required to produce the desired result. Since any single query may have several different but equally correct query plans the query planner will construct several for evaluation.

An estimated execution cost is then calculated using table, index, and column statistics collected by PostgreSQL. From these statistics, it is able to estimate the number of records and consequently the number of blocks of data that must be read from disk. Details such as potential page cache hits and ordering of data are used to determine the speed at which data can be read. These speeds are represented by unit-less parameters that describe the relative “cost” of such accesses as random and sequential disk reads and memory cache accesses. Since the actual values of the parameters used to create the cost estimate have meaning only relative to each other, and the query planner does not attempt to account for current load on the database, this cost is not an absolute estimate of time and cannot be relied upon as such.

Once the cost of the query plans have been estimated, the lowest cost plan is selected for execution. While the process to this stage is hidden from us, the final plan selection is not. The Explain command will generate a description of the query plan that is ultimately chosen.

To investigate a simple query plan, we must first configure pgAdmin to provide us with a full analysis of the query. Under the Query menu, select the Explain options sub-menu and make sure that Analyze is checked. This will cause the query to be executed and both the planned costings and the actual execution time will be displayed. Using Analyze in conjunction with Explain is far more powerful than using Explain alone, as it shows where the query planner has made mistakes. But since the query is executed, this process can take a long time to complete.

_images/explain01.png

Now enter the following query into the query window as shown below, but do not execute the query. Instead, click the Explain query button.

SELECT namelow FROM jacksonco_streets, medford_citylimits
  WHERE ST_Intersects(jacksonco_streets.the_geom,
      medford_citylimits.the_geom)
  GROUP BY namelow;
_images/index03.png

In the resulting diagram we can see the four different stages involved in executing the query. Additional details can be viewed by clicking on one of the icons. These stages are:

Sequence Scan

_images/sequencescan.png

A sequence scan is a linear scan of each and every row in the table. In this case medford_citylimits is scanned. The query planner has estimated a cost of 1.01 and an expected result of a single record. The actual timing corroborates the query planner’s estimates, showing a run time between 0.144 and 0.147 milliseconds. While the cost estimate cannot be directly translated into a time with any reliability, a cost of 1 is very low, on the order of a single sequential disk access, so the elapsed time of 0.003ms is appropriately tiny. Click on the graphic in the box to see the details.

_images/plan05.png

Index Scan

_images/indexscan.png

An index scan is just what it sounds like; PostgreSQL will scan the index of the indicated table, in this case jacksonco_streets_gix, to identify the required subset of records. This subset is then pulled from the table and passed to the next step of the query plan. You can see below that the filter used against the index is (jacksonco_street.the_geom && medford_citylimits.the_geom), resulting in a subset of streets that still need to be checked for proper intersection against the city limit polygon. The query planner has estimated the cost to be 8.27 with an expected result of 1 record.

The actual results tell a different story. The query planner expected only one row to result from the index scan, but in reality 4582 rows were retrieved. Comparing the cost with the actual time shows a considerable discrepancy as well, with 5235ms from an estimated 8.27 effort.

_images/plan06.png

Nested Loop

_images/nestedloop.png

The nested loop performs a join between the two results sets using the filter criteria _st_intersects(jacksonco_streets.the_geom, medford_citylimits.the_geom). The order of this step relative to the two scans is somewhat misleading; the nested loop uses the results of the sequence scan to provide the filter criteria for the index scan. The loop performs the index scan for each record returned by the sequence scan. The query planner has estimated the cost to be 9.30, which includes the execution of both the sequence and index scans, with an expected result of 1568 records.

Much like the sequence scan, the nested loop has underestimated both the effort and number of results. These errors can quickly inflate the execution times of queries, in particular when they effect loops. It is also worth noting that the execution time of the nested loop overlaps both the sequence scan and index scan.

_images/plan07.png

Note

The functions listed in the Explain Results are not the same as the functions from our query. This is because the ST_Intersects function is actually just a wrapper to the && bounding box comparison and the _ST_Intersects function that performs the intersection.

Hash Aggregate

_images/hashaggregate.png

A hash aggregate is used for grouping results when a hashing algorithm is available for the data type being grouped, in this case the varchar, namelow. It is significantly faster than the alternative of sorting and grouping it two stages. The query planner expects only a single row to be input into the hash aggregate, and as such expects only one row to be returned. We can see that the estimated cost starts at 13.22, after the nested loop completes, and finishes at 28.90, processing the 1568 records produced by the nested loop.

_images/plan08.png

The estimates of effort reported by the Explain feature are only relatively meaningful, being comprised of unit-less parameters such as random_page_cost described in the previous section. But it does give a ballpark idea of how long the query might take. In this case, the overall effort is estimated at 28.90 of which is 8.27 is allocated to the repeated index scan and 15.78 to grouping by namelow using a hash aggregate.

In this case, the actual timing details are far more favorable than the estimated effort. Instead of taking almost half the cost, the aggregate has run in fewer than 2ms.

This example illustrates the need for an Explain Analyze, instead of the pure Explain that provides only the cost estimates. Without the specific details, you don’t get the whole picture of where the inefficiencies in your query lie. Running Explain without Analyze is a good first step if you expect your query to take a long time, but it will rarely be your last step in tuning a query.