PostgreSQL 14’s enable_memoize For Improved Performance of Nested Loop Joins

I’ve recently discovered a pleasant new addition to PostgreSQL 14, the new enable_memoize flag that improves the performance of some nested loop joins where statistics hint at this being appropriate. I mean, who can resist this temptation:

Improving query speed by 1000x hints at something very suboptimal having been going on before, and a tool like memoization can be of great help. But will it also help with an “ordinary” join? I wanted to try myself.

What’s memoization?

In a perfect world free of side effects (and SQL is such a perfect world, in theory), memoization means that we can substitute y for f(x) in any computation, given that y = f(x). For example, no matter how many times you calculate UPPER('x'), you’ll always get 'X'. If the calculation of such a function is costly, and there are only few possible input values, then why not just maintain a hash map that maps all previous input values and use that to look up known (or at least frequent) values instead of computing them again?

As I’ve shown previously on this blog, Oracle 11g has introduced a feature called scalar subquery caching, a feature which you can activate in jOOQ to avoid costly PL/SQL context switches.

In the case of PostgreSQL’s enable_memoize, this can be particularly useful for nested loop joins in SQL, and to reference the above tweet, a lateral join is often executed via a nested loop join.

Turning the feature on and off

I’ve created a schema like this:

CREATE TABLE t AS
SELECT i, i % 5 AS j 
FROM generate_series(1, 100000) AS t(i);

CREATE TABLE u AS
SELECT i, i % 20000 as j 
FROM generate_series(1, 100000) AS t(i);

CREATE INDEX uj ON u(j);

In summary:

  • Both tables t and u have 100000 rows.
  • t.j has only 5 distinct values, each value appears 20000 times.
  • u.j has 20000 distinct values, each value appears 5 times.

When running this on PostgreSQL 14:

SELECT current_setting('enable_memoize');

I get:

|current_setting|
|---------------|
|on             |

So, the feature is active, which I can also see in an EXPLAIN of the following query:

EXPLAIN
SELECT *
FROM t JOIN u ON t.j = u.j;

The plan is:

|QUERY PLAN                                                            |
|----------------------------------------------------------------------|
|Nested Loop  (cost=0.30..8945.41 rows=496032 width=16)                |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)         |
|  ->  Memoize  (cost=0.30..0.41 rows=5 width=8)                       | 
|        Cache Key: t.j                                                |
|        ->  Index Scan using uj on u  (cost=0.29..0.40 rows=5 width=8)|
|              Index Cond: (j = t.j)                                   |

Without memoization, when joining the two tables like that, then, for the 100000 rows in t, I have to look up 100000x the 5 matching rows in u. But if memoization kicks in, then I will have to perform the lookup only 5 times, because there are only 5 distinct values of t.j

We can play around with execution plans by turning the feature on or off:

SET enable_memoize = ON;
SET enable_memoize = OFF;

When turned off, PostgreSQL seems to choose a hash join or merge join instead, on my machine (between multiple executions, the plan might switch)

|QUERY PLAN                                                         |
|-------------------------------------------------------------------|
|Hash Join  (cost=3084.00..11568.51 rows=499351 width=16)           |
|  Hash Cond: (t.j = u.j)                                           |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)      |
|  ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)            |
|        ->  Seq Scan on u  (cost=0.00..1443.00 rows=100000 width=8)|


|QUERY PLAN                                                              |
|------------------------------------------------------------------------|
|Merge Join  (cost=9748.11..763846.11 rows=50000000 width=16)            |
|  Merge Cond: (u.j = t.j)                                               |
|  ->  Index Scan using uj on u  (cost=0.29..3848.29 rows=100000 width=8)|
|  ->  Sort  (cost=9747.82..9997.82 rows=100000 width=8)                 |
|        Sort Key: t.j                                                   |
|        ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)     |

Let’s Benchmark

We’re using the usual benchmark technique described here:

  • We repeat an operation 25x in mode A and mode B and compare (or more than 25, if it’s a fast operation)
  • We repeat the above 5x to mitigate any warmup and other caching effects

You can run the following benchmark on the above schema yourself, to verify:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 25;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();
    SET enable_memoize = OFF;
  
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT t.*
        FROM t JOIN u ON t.j = u.j
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();
    SET enable_memoize = ON;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT t.*
        FROM t JOIN u ON t.j = u.j
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

On my machine, the results are consistent, decent, not too impressive, but still significant:

Run 1, Statement 1: 00:00:03.763426
Run 1, Statement 2: 00:00:03.401346

Run 2, Statement 1: 00:00:03.769419
Run 2, Statement 2: 00:00:03.375677

Run 3, Statement 1: 00:00:03.771465
Run 3, Statement 2: 00:00:03.374413

Run 4, Statement 1: 00:00:03.769136
Run 4, Statement 2: 00:00:03.398734

Run 5, Statement 1: 00:00:03.772544
Run 5, Statement 2: 00:00:03.375272

I.e. a 10% speedup. Across the whole system, that alone would already be worth it.

Optimising LATERAL

Let’s try optimising LATERAL instead. We could run a query like this:

SELECT *
FROM 
  t, 
  LATERAL (
    SELECT count(*) 
    FROM u 
    WHERE t.j = u.j
  ) AS u(j)

The EXPLAIN of the above is

|QUERY PLAN                                                                       |
|---------------------------------------------------------------------------------|
|Nested Loop  (cost=4.40..3969.47 rows=100000 width=16)                           |
|  ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=8)                    |
|  ->  Memoize  (cost=4.40..4.42 rows=1 width=8)                                  |
|        Cache Key: t.j                                                           |
|        ->  Aggregate  (cost=4.39..4.40 rows=1 width=8)                          |
|              ->  Index Only Scan using uj on u  (cost=0.29..4.38 rows=5 width=0)|
|                    Index Cond: (j = t.j)                                        |

So, we can again cache the computation of the COUNT(*) value for each of the 5 distinct t.j input values, rather than re-calculating this every time. Surely, this must be even better than before?

Benchmark time!

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 25;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();
    SET enable_memoize = OFF;
  
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT *
        FROM 
          t, 
          LATERAL (
            SELECT count(*) 
            FROM u 
            WHERE t.j = u.j
          ) AS u(j)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();
    SET enable_memoize = ON;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT *
        FROM 
          t, 
          LATERAL (
            SELECT count(*) 
            FROM u 
            WHERE t.j = u.j
          ) AS u(j)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

And this time, we can see a significant speedup!

Run 1, Statement 1: 00:00:03.419728
Run 1, Statement 2: 00:00:01.083941

Run 2, Statement 1: 00:00:03.404954
Run 2, Statement 2: 00:00:01.098404

Run 3, Statement 1: 00:00:03.425725
Run 3, Statement 2: 00:00:01.093883

Run 4, Statement 1: 00:00:03.441691
Run 4, Statement 2: 00:00:01.127837

Run 5, Statement 1: 00:00:03.420172
Run 5, Statement 2: 00:00:01.097943

That’s great news! Wait, does this work also for ordinary correlated subqueries? Because the above LATERAL correlated subquery could be rewritten as:

SELECT 
  t.*, 
  (
    SELECT count(*)
    FROM u
    WHERE t.j = u.j
  ) j
FROM t;

Regrettably, the plan doesn’t show memoization:

|QUERY PLAN                                                                   |
|-----------------------------------------------------------------------------|
|Seq Scan on t  (cost=0.00..441693.00 rows=100000 width=16)                   |
|  SubPlan 1                                                                  |
|    ->  Aggregate  (cost=4.39..4.40 rows=1 width=8)                          |
|          ->  Index Only Scan using uj on u  (cost=0.29..4.38 rows=5 width=0)|
|                Index Cond: (j = t.j)                                        |

And the benchmark (you can paste the query into the benchmark logic yourself to reproduce) confirms there’s no memoization effect

Run 1, Statement 1: 00:00:03.617562
Run 1, Statement 2: 00:00:03.605765

Run 2, Statement 1: 00:00:03.610084
Run 2, Statement 2: 00:00:03.682064

Run 3, Statement 1: 00:00:03.725952
Run 3, Statement 2: 00:00:03.705622

Run 4, Statement 1: 00:00:03.672669
Run 4, Statement 2: 00:00:03.644612

Run 5, Statement 1: 00:00:03.645741
Run 5, Statement 2: 00:00:03.642717

It seems that with this new feature, correlated subqueries could be rewritten to nested loop outer joins in the future? Other optimisers already do this, and we would have effectively the same feature here as Oracle’s scalar subquery caching.

Conclusion

The feature is turned on in PostgreSQL 14. Apart from some additional memory consumption, which might be a small problem if the optimiser is wrong and statistics are off, I don’t see any drawback of this new feature. SQL is a side-effect free (in theory) 4GL, meaning the optimiser can replace any computation by a cached value that depends only on the computation’s input values.

A correlated subquery is a function, whose input parameters are the predicates and other references to the outer query’s columns. As such, the result of a correlated subquery can be cached, or memoized. As shown above, this has drastic effects on your present day SQL queries, from whcih you can profit just by upgrading to PostgreSQL 14.

Leave a Reply