Postgres 12 CTE Enhancements

The Good/Automatic, The Bad/Manual, The Ugly/Refactoring

by Nick Sorrell

When we started work on our custom TMS in 2015, we looked around at the landscape for a database that met the following requirements:

  • Fast and easy dev environment setup. This almost instantly ruled out Microsoft SQL Server at the time. There was no good way to programatically (read Docker/Vagrant) stand up a replica/dev database, and their licensing was beyond our budget at the time.
  • Support for Common Table Expressions (CTEs). This one is small at first blush, but good devs are taught about composability, and reusability, and when you show them CTEs, everything just clicks and makes sense. This ruled out MySQL at the time, as they didn’t support CTEs.
  • Window functions. The Lag/Lead/Rank and Partitioning over columns were more distinct advantages that MySQL lacked. But these are so crucial for readability when you need that functionality.

So, after evaluating those prerequisites, we were left with a clear choice: PostgreSQL. We started out at version 9.5 during development, before they had introduced parallelism, but we were hooked from the start. In fact, without Postgres, I doubt that we would have been able to complete our implementation and data migration (which included an accounting platform migration) in just 9 months.

All of that’s to say I was really excited to see that PG12 now optimizes CTEs out of the box. So I dug right into it, launching a Docker container side-by-side with the exact same specs as my PG10 container.

And to my dismay, PG12 offered no performance improvements out of the box on our heaviest queries. I remembered a post I had read a few months earlier talking about the optimizations and dug in… and I found the NOT MATERIALIZED directive as the key.

After applying this to my CTEs, I saw a 50% reduction in query time. I was floored by this, and started digging deeper into what this was doing.

What I found is that the NOT MATERIALIZED directive means that PG will utilize the existing, underlying tables (and more importantly, their indexes) from within the CTE, rather than creating an unindexed, temporary table (MATERIALIZED) with the results from the CTE. Utilizing the indexes offers a HUGE improvement in query time.

But why did I need to manually add this directive, rather than PG optimizing for it automatically? Well, I was using all of these more than once in my query - I used them both in subsequent CTEs, and in the joins to the main table. By default, PG won’t optimize if a CTE is used more than once. The queries looked something like this:

WITH houses AS (
    SELECT id, street, house_address 
    FROM my_city m
    WHERE m.street = 'My Street Name'
),
fire_hydrants AS (
    SELECT id, street, house_address 
    FROM city_hydrants c
    WHERE c.street IN (SELECT street FROM houses)
)
SELECT id, street, house_address 
FROM telephone_poles t
INNER JOIN houses h
ON h.street = t.street
INNER JOIN fire_hydrants f
ON f.street = t.street;

In the query above, I want to see streets that have houses, telephone poles, and fire hydrants. And because I use the houses CTE twice (in the fire_hydrants CTE and then the subsequent JOIN to telephone_poles), PG12 won’t optimize it, but rather use the previous default of creating a materialized table - meaning it can’t utilize the houses index, which in this case would be street. By adding NOT MATERIALIZED we force PG to take advantage of our underlying indexes.

All of this is to say that with PG12, your mileage may vary (YMMV) per query, and that database tuning hasn’t lost its ‘artform’ even in 2020.