Spatial indices are one of the greatest assets of PostGIS. As we saw in the previous example with ST_DWithin, determining the relationship between two geometries can often be sped up by first determining the relationship between the bounding boxes of the geometries. Indices take this concept one step further by indexing the bounding box of the geometry rather than the geometry itself.
Spatial indices are simple to create in PostGIS. The following commands would create an index on the geometry column of the jacksonco_streets table.
CREATE INDEX jacksonco_streets_gix ON jacksonco_streets USING GIST (the_geom);
The USING GIST clause tells PostgreSQL to use the generic index structure (GIST) when building the index. This is important because the default index type is the B-Ttree. B-Tree indices are not lossy (inexact) in the way a GIST index can be. This means that while the GIST index only indexes the bounding box of the geometry, the B-Tree must index the entire geometry, which can often be larger than the index can cope with. If you receive an error that looks like ERROR: index row requires 11340 bytes, maximum size is 8191 when creating your index, you have likely neglected to add the USING GIST clause.
While the index builds lets see what is actually happening when we use a spatial index.
The picture on the left shows a polygon and three lines. You can easily see that only the red line crosses the polygon, the other two do not touch or are disjoint. The picture in the middle shows these same objects with their minimum bounding rectangle, also know as their bounding box. Finally, on the right are just the bounding boxes.
We can see in this case that a bounding box filter that asks for the intersection of the polygon and the linestrings will match the red and blue lines and not the green line. The bounding box intersection operator is &&, and is added to the WHERE clause of a query as shown.
SELECT namelow FROM jacksonco_streets, medford_parks WHERE jacksonco_streets.the_geom && medford_parks.the_geom AND medford_parks.name = 'Hawthorne Park / Pool';
The && operator itself doesn’t guarantee that the geometries themselves intersect, just their bounding boxes. As such, it is rarely used on its own. For this reason most of the relational functions will include the && as part of the query. ST_DWithin for example includes two && operations that match the bounding box of each geometry with the expansion of the bounding box of the other. In this way, every ST_DWithin function call can take advantage of bounding boxes available on either or both geometries.
By storing the bounding boxes themselves in an R-Tree index, the boxes can be quickly located and compared and the number of full geometry comparisons reduced before the geometry is ever read from disk.
Our index should be created by now. In order to update the statistics that the database uses to select a query plan, we must run the following command.
VACUUM ANALYZE jacksonco_streets;
The ANALYZE command asks PostgreSQL to traverse the table and update its internal statistics used for query plan estimation (query plan analysis will be discussed later). The VACUUM command asks PostgreSQL to reclaim any unused space in the table pages left by updates or deletes to records. The VACUUM ANALYZE command performs both these actions, and will be discussed in more detail later.
I will continue to use the command version of VACUUM ANALYZE, but there is a way to do this through the GUI. Right click on the a table and select the Maintenance... option.
The resulting screen provides the options to execute VACUUM, ANALYZE or REINDEX commands. Selecting VACUUM and checking the ANALYZE option will perform the same actions as our query command. The Maintenance screen is available from the database, table or index level.
Now, let’s run our query again.
SELECT namelow FROM jacksonco_streets, medford_parks WHERE ST_DWithin(medford_parks.the_geom, jacksonco_streets.the_geom, 5000 / 0.3048) AND medford_parks.name = 'Hawthorne Park / Pool';
Now we can see some serious improvement in speed. Many of the relational operators PostGIS offers have built-in bounding box operations. Take a moment now and create indices on your the other spatial columns.
CREATE INDEX jacksonco_schools_gix ON jacksonco_schools USING GIST (the_geom); CREATE INDEX jacksonco_taxlots_gix ON jacksonco_taxlots USING GIST (the_geom); CREATE INDEX medford_buildings_gix ON medford_buildings USING GIST (the_geom); CREATE INDEX medford_citylimits_gix ON medford_citylimits USING GIST (the_geom); CREATE INDEX medford_hydro_gix ON medford_hydro USING GIST (the_geom); CREATE INDEX medford_parks_gix ON medford_parks USING GIST (the_geom); CREATE INDEX medford_planzone_gix ON medford_planzone USING GIST (the_geom); CREATE INDEX medford_stormdrain_gix ON medford_stormdrain USING GIST (the_geom); CREATE INDEX medford_wards_gix ON medford_wards USING GIST (the_geom); CREATE INDEX medford_wetlands_gix ON medford_wetlands USING GIST (the_geom); CREATE INDEX medford_zoning_gix ON medford_zoning USING GIST (the_geom); CREATE INDEX tracts_gix ON tracts USING GIST (the_geom);
It’s worth stressing that just creating an index is not enough to allow PostgreSQL to use it effectively. VACUUMing must be performed when ever a new index is created or after a large number of UPDATEs, INSERTs or DELETEs are issued against a table. This is important enough to the efficient running of the database that PostgreSQL provides an “autovacuum” option to perform this function automatically as tables are updated.
Autovacuum is enabled by default and will both vacuum (recover space) and analyze (update statistics) on your tables at sensible intervals determined by the level of activity. While this is essential for highly transactional databases, it is not advisable to wait for an autovacuum run after adding indices or bulk-loading data. If a large batch update is performed, you should manually run VACUUM.
Vacuuming and analyzing the database can be performed separately as needed. Issuing VACUUM command will not update the database statistics; likewise issuing an ANALYZE command will not recover unused table rows. Both commands can be run against the entire database, a single table or a single column.
Clustering is a PostgreSQL feature that allows the reordering of a table on disk based on an index. This is simple to understand for standard B-Tree indices, such as for numeric or text fields.
The table is essentially completely rewritten from an initial unordered state into an ordered state. This ensures that records with similar attributes have a high likelihood of being found in the same page, reducing the number of pages that must be read into memory for some types of queries. This is harder to visualise for spatial data since the index doesn’t have a simple linear ordering, but the effect is the same. Geometries that are near each other in space are near each other on disk.
The command is simple enough:
CLUSTER jacksonco_streets USING jacksonco_streets_gix;
Since the database statistics include information about table ordering, it is highly recommended to analyze the table after clustering. Vacuuming is unnecessary.