How we used Postgres extended statistics to achieve a 3000x speedup

Jared Rulison
Affinity
Published in
7 min readAug 12, 2020

--

Go on. Guess.

Much like the DMV, the PostgreSQL query planner is a powerful, mysterious entity to whom we semi-blindly entrust our well-being. It has the crucial responsibility of picking the most efficient execution plan for every query. Here we’ll explore what data Postgres takes into account when creating query plans, and how we used that context to help the query planner come up with more efficient plans for some of our most important query patterns.

Here’s an example slow query issued from our web server, along with the inefficient query plan that Postgres chose. Can you spot the key mistake the query planner made?

By far the most expensive step is the second Nested Loop Join:

Nested Loop Semi Join (cost=1.01..25.07 rows=1 width=4) (actual time=0.079..122074.806 rows=1958 loops=1).

Postgres estimated that this step would return about 1 row, which was a wild underestimate — it actually returned 1958 rows and took about 122 seconds. (See here for more background on how to interpret Postgres query plans.)

Through informed use of Postgres statistics, we brought the time for this query down from 2 minutes to 42 milliseconds — almost a 3000x speedup! Before we dive into the stats adjustments that we made, let’s make sure we understand how the Postgres planner works.

Postgres Statistics and Query Plans

Basic Statistics

Statistics are data collected by Postgres used to inform its selection of query plans. Out of the box, Postgres samples the possible values for each column of each table to create histograms and a list of the most common values (among other things). These are used to estimate how many rows will result from applying some set of filters to a table.

For larger tables, the planner can’t keep track of every single value a column holds. Instead, it samples the values of each column and uses those to make estimations. We can tweak how much sampling Postgres does for each column on each table with

ALTER TABLE table ALTER column SET STATISTICS {-1 ..10000}

where -1 sets it to the default value of 100 (docs). This number sets how many buckets are used in the histogram and how many of the most common values are stored.

The downsides to increasing the statistics for a column are that more data must be stored in pg_statistic and running ANALYZE on the column's table takes longer.

More details can be found in the Postgres docs.

Extended Statistics

Extended statistics are user-defined objects that tell Postgres to collect certain kinds of data for sets of columns, rather than individual columns.

Without extended statistics, Postgres estimates the impact of filters on a table by considering each filter independently. For example, consider a database containing 10 Artist records, each of which has 10 Album records referencing it, each of which has 10 Songs referencing that. This totals to 10 Artists, 100 Albums, and 1,000 Songs. Now, consider running the following query:

SELECT * FROM songs WHERE (artists_id = 1 and album_id = 1);

With perfect sampling, the query plan might look like

Index Scan using songs_artists_id_album_id_index on songs  (cost=0.28..6.05 rows=1 width=159) (actual time=5.555..5.562 rows=10 loops=1)
Index Cond: ((artists_id = 1) AND (album_id = 1))
Planning Time: 311.482 ms
Execution Time: 9.266 ms
(4 rows)

(cost=0.28..6.05 rows=1 width=159) refers to the planner's estimations while (actual time=5.555..5.562 rows=10 loops=1) refers to the actual results of the executing the plan. The planner estimated 1 row would be returned, but there were actually 10.

The planner calculated its row estimate by first taking the total number of Songs (1000), then considering the artists_id filter. 10% of Songs have artists_id = 1 so that leaves 100 Songs. Next it considers the album_id filter. 1% of Songs have album_id = 1, so it's left with 1 Song.

The key piece of information Postgres is missing is that artist_id and album_id are strongly correlated. In fact, knowing the album_iduniquely determines the artist_id. Had Postgres known about this, it could have used only the album_id = 1 filter in its estimation and come up with the correct result of 10 Songs.

This kind of correlation can be indicated to Postgres using a dependency statistic. This statistic stores the frequency with which each column uniquely determines the other column. A dependency statistic on (artist_id, album_id) might yield the following:

CREATE STATISTICS album_id_artist_id_dep_stt (dependencies) ON album_id, artist_id FROM songs;ANALYZE songs;SELECT stxname, stxkeys, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts';
stxname | stxkeys | stxddependencies
---------+---------+------------------------------------------
stts | 1 5 | {"1 => 5": 0.1, "5 => 1": 1.0}
(1 row)

The 1 and 5 under stxkeys and stxddependencies refer to the 1st and 5th columns on the songs table, which are artist_id and album_id, respectively. The value for "1 => 5" is 0.1 since artist_id determines album_id 10% of the time. The value for "5 => 1" is 1.0 since album_id always determines artist_id. When Postgres is filtering by columns with a matching dependency statistic, it’s able to use that to make a more accurate estimation.

There are, of course, other kinds of extended statistics but a dependency statistic makes the most sense for this kind of data distribution.

One caveat of extended statistics is that Postgres only knows to use them when filtering on exactly the columns referenced in the statistic and when filtering using simple equality conditions, e.g. artist_id = 5 and not artist_id IN (5, 6) or artist_id < 10.

Use of extended statistics can lead to non-intuitive index choices. If a dependency statistic indicates to Postgres that a column filter is redundant, as in the case of artist_idand album_id, it may opt to use an index that only references one of the columns. In the case of songs, it may use an index on only (album_id) instead of an index on (artist_id, album_id) if both are present.

Join Strategies

There are three options Postgres has for joining tables:

  1. Nested Loop Join. Using this join strategy, Postgres loops through each row in the left relation and scans through the right relation for rows that satisfy the join condition, ideally using an index. This is an effective strategy for when there are very few rows in the left relation.
  2. Merge Join. From the docs: “each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.”
  3. Hash Join. From the docs: “the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.”

For our purposes, the main thing to note here is that the advantage of a Nested Loop Join is that there’s very little overhead compared to the other join strategies. However, this join can go wrong if there are many rows in the left relation. For example, suppose there are 1,000 rows in the left relation and Postgres is using an index to access the right relation. If each index access takes 4ms, the entire join will take 4s, which is too slow in the context of responding to a user request.

Applying Statistics In Practice

Now that we understand the different type of joins, let’s revisit the Nested Loop Join that struck us as problematic. Without going into too much detail about our data model at Affinity, all you need to know is that on our tables entity_values and lists_entries, the column org_id is uniquely determined by list_id or entity_attribute_id, meaning that in order to estimate the selectivity of a set of filters on these columns, the filters should not be considered individually. Our slow queries were the result of Postgres underestimating the number of rows that would result from applying a filter condition and opting to use a nested loop join because of that underestimation.

Actions Taken

Let’s look back at our original problem query. By far, the most costly step was looping over the index access to entity_values_org_id_entity_attribute_id_company_id_index a whopping 13,769 times.

To encourage the planner to use a different join strategy, we needed to improve its estimates for filters on lists_entries and entity_values. Based on the filters applied, we maxed out the per-column statistics for:

lists_entries:
- org_id
- list_id
entity_values:
- org_id
- entity_attribute_id

among other tables and columns for different query patterns.

We also added dependency statistics on:

lists_entries (list_id, org_id)
entity_values (entity_attribute_id, org_id)

among other dependency statistics for other tables and columns, since both list_id and entity_attribute_id uniquely determine the org_id.

After we made these adjustments, Postgres chose the following query plan for our original query:

Here, the estimates are much more accurate and the planner opted for a hash join for the inner join — and the query took 42 milliseconds instead of the original 2 minutes.

Conclusion

Increasing the per-column statistics and adding dependency statistics have helped tremendously, but there is still progress to be made. As you may have noticed in the improved query plan, the planner underestimates the number of rows resulting from the inner join. While the outer nested loop join didn’t take long this time, it’s not hard to imagine a query where the inner join results in many rows and the outer join becomes a bottleneck.

We hope this post has given you some ideas about how to improve your query plans, or at the very least taught you something about the magic of Postgres!

--

--