How to Calculate Multiple Aggregate Functions in a Single Query

At a customer site, I’ve recently encountered a report where a programmer needed to count quite a bit of stuff from a single table. The counts all differed in the way they used specific predicates. The report looked roughly like this (as always, I’m using the Sakila database for illustration):

-- Total number of films
SELECT count(*)
FROM film

-- Number of films with a given length
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150

-- Number of films with a given language
SELECT count(*)
FROM film
WHERE language_id = 1

-- Number of films for a given rating
SELECT count(*)
FROM film
WHERE rating = 'PG'

And then, unsurprisingly, combinations of these predicates were needed as well, i.e.

-- Number of films with a given length / language_id
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND language_id = 1

-- Number of films with a given length / rating
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND rating = 'PG'

-- Number of films with a given language_id / rating
SELECT count(*)
FROM film
WHERE language_id = 1
AND rating = 'PG'

-- Number of films with a given length / language_id / rating
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND language_id = 1
AND rating = 'PG'

In the end, there were 32 queries in total (or 8 in my example) with all the possible combinations of predicates. Needless to say that running them all took quite a while, because the table had around 200M records and only one predicate could profit from an index.

But in fact, the improvement is really easy. There are several options to calculate all these counts in a single query

Simplest solution works in all databases: Filtered aggregate functions (or manual pivot)

This solution allows for calculating all results in a single query by using 8 different, explicit, filtered aggregate functions and no GROUP BY clause (none in this example. More complex cases where GROUP BY persists are sill imaginable).

This is how it works on all databases:

SELECT 
  count(*),
  count(length),
  count(language_id),
  count(rating),
  count(length + language_id),
  count(length + rating),
  count(language_id + rating),
  count(length + language_id + rating)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    CASE WHEN rating = 'PG'              THEN 1 END rating
  FROM film
) film

Which yields:

col1  col2  col3  col4  col5  col6  col7  col8
1000  224   1000  194   224   43    194   43

How to read the above query?

Instead of evaluating the three different predicates in a WHERE clause, we pre-calculate it in a derived table (subquery in the FROM clause) and translate the predicate in some random value (e.g. 1) if TRUE and NULL if FALSE. Note, I omitted the ELSE clause from the CASE expression, which means that we get NULLs per default. Running the nested select on its own…

SELECT
  CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
  CASE WHEN language_id = 1            THEN 1 END language_id,
  CASE WHEN rating = 'PG'              THEN 1 END rating
FROM film

… yields something along the lines of:

length  language_id  rating
---------------------------
NULL    1            1
NULL    1            NULL
NULL    1            NULL
NULL    1            NULL
1       1            NULL
NULL    1            1
NULL    1            NULL
...

(Note, of course, we could have used actual BOOLEAN types, e.g. in PostgreSQL, but that wouldn’t work on all databases)

Now, in the outer query, we’re using once COUNT(*), which simply counts all the rows regardless of any predicates in the CASE expressions. The other COUNT(expr) aggregate functions do something that surprisingly few people are aware of (yet a lot of people use this form “by accident”). They count only the number of non-NULL rows. For instance:

SELECT 
  ...
  count(length),
  ...
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    ...
  FROM film
) film

Or also:

SELECT 
  count(CASE WHEN length BETWEEN 120 AND 150 THEN 1 END)
FROM
  film

These queries will count those films whose length is BETWEEN 120 AND 150 (because those rows produce the value 1, which is non-NULL, and thus counted), whereas all the other films are not being counted.

Finally, I just used a trick to combine nullable values to make sure they’re all non-NULL:

SELECT 
  ...
  count(length + language_id),
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    ...
  FROM film
) film

This counts those rows whose length BETWEEN 120 AND 150 and whose language_id = 1, because if either predicate was FALSE, the number would be NULL and thus the sum is NULL as well.

PostgreSQL and HSQLDB variant: FILTER

In PostgreSQL and HSQLDB (and in the SQL standard), there’s a special syntax for this. We can use the FILTER clause instead of encoding values in NULL / non-NULL like this:

SELECT 
  count(*),
  count(*) FILTER (WHERE length IS NOT NULL),
  count(*) FILTER (WHERE language_id IS NOT NULL),
  count(*) FILTER (WHERE rating IS NOT NULL),
  count(*) FILTER (WHERE length + language_id IS NOT NULL),
  count(*) FILTER (WHERE length + rating IS NOT NULL),
  count(*) FILTER (WHERE language_id + rating IS NOT NULL),
  count(*) FILTER (
    WHERE length + language_id + rating IS NOT NULL)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    CASE WHEN rating = 'PG'              THEN 1 END rating
  FROM film
) film

Or even, writing out the entire predicates again:

SELECT 
  count(*),
  count(*) FILTER (WHERE length BETWEEN 120 AND 150),
  count(*) FILTER (WHERE language_id = 1),
  count(*) FILTER (WHERE rating = 'PG'),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 AND language_id = 1),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 AND rating = 'PG'),
  count(*) FILTER (
    WHERE language_id = 1 AND rating = 'PG'),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 
    AND language_id = 1 AND rating = 'PG')
FROM film

Usually, the FILTER clause is more convenient, but both approaches are equivalent, and we’re running only a single query!

I also call this “manual PIVOT“, because it really works like a PIVOT table. And the good news is… There is a PIVOT syntax!

A more fancy solution: PIVOT

This solution is vendor-specific and only works in Oracle and with a bit less features in SQL Server. Here’s the Oracle version:

SELECT 
  a + b + c + d + e + f + g + h,
                  e + f + g + h,
          c + d         + g + h,
      b     + d     + f     + h,
                          g + h,
                      f     + h,
              d             + h,
                              h
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
PIVOT (
  count(*) FOR (length, language_id, rating) IN (
    (0, 0, 0) AS a,
    (0, 0, 1) AS b,
    (0, 1, 0) AS c,
    (0, 1, 1) AS d,
    (1, 0, 0) AS e,
    (1, 0, 1) AS f,
    (1, 1, 0) AS g,
    (1, 1, 1) AS h
  )
)

How to read this solution? There are 3 steps:

Step 1: The derived table

As in the previous example, we’re translating the desired predicates for our report into three columns that produce values 1 and 0. That’s understood so I won’t repeat the explanation.

Step 2: The PIVOT clause

The PIVOT clause can be applied to a table expression to “pivot” it in a similar way as we’re used from Microsoft Excel’s powerful pivot tables. It takes three parts:

  • A list of aggregate functions
  • An expression (FOR clause)
  • A list of expected values (IN clause)

The resulting table expression groups the PIVOT‘s input table by all the remaining columns (i.e. all the columns that are not part of the FOR clause, in our example, that’s no columns), and aggregates all the aggregate functions (in our case, only one) for all the values in the IN list.

If we SELECT * from this PIVOT table:

SELECT *
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
PIVOT (
  count(*) FOR (length, language_id, rating) IN (
    (0, 0, 0) AS a,
    (0, 0, 1) AS b,
    (0, 1, 0) AS c,
    (0, 1, 1) AS d,
    (1, 0, 0) AS e,
    (1, 0, 1) AS f,
    (1, 1, 0) AS g,
    (1, 1, 1) AS h
  )
)

… we’ll get these values:

a    b    c    d    e    f    g    h
------------------------------------
0    0  625  151    0    0  181   43

As you can see, the column names are generated from the IN list of expected values and the values contained in these columns are aggregations for the different predicates. These aggregations are not exactly the ones we wanted. For instance, column G is all the films whose length BETWEEN 120 AND 150 and whose language_id = 1 and whose RATING != 'PG'.

Step 3: Summing the count values

So, in order to get the expected results, we have to sum all the partial counts as such:

SELECT 
  a + b + c + d + e + f + g + h,
                  e + f + g + h,
          c + d         + g + h,
      b     + d     + f     + h,
                          g + h,
                      f     + h,
              d             + h,
                              h
FROM 
  ...

The result is now the same.

A more fancy solution: GROUPING SETS

GROUPING SETS are a SQL standard and they’re supported in at least:

  • DB2
  • HANA
  • Oracle
  • PostgreSQL
  • SQL Server
  • Sybase SQL Anywhere

Simply put, GROUPING SETS allow for grouping a table several times and creating a UNION of all the results. For example, the following two queries are the same, conceptually, although the GROUPING SETS one is usually faster:

-- Grouping once by language_id, then by rating
SELECT language_id, rating, count(*)
FROM film
GROUP BY GROUPING SETS (
  (language_id),
  (rating)
)

-- Grouping first by language_id
SELECT language_id, NULL, count(*)
FROM film
GROUP BY language_id
UNION ALL
SELECT NULL, rating, count(*)
FROM film
GROUP BY rating

Both queries yield:

language_id   rating   count
          1             1000 -- First grouping set / union subquery
              G          178 \
              PG         194  |
              PG-13      223  | Second grouping set / union subquery
              R          195  |
              NC-17      210 /

Clearly, the GROUPING SETS variant is more concise. Let’s imagine, we’d like to add more combinations of grouping columns, e.g.

SELECT language_id, rating, count(*)
FROM film
GROUP BY GROUPING SETS (
  (),
  (language_id),
  (rating),
  (language_id, rating)
)

Now, we’re grouping by all the combinations of columns, and the result is:

language_id   rating   count
                        1000 -- First grouping set: ()
          1             1000 -- Second grouping set: (language_id)
              G          178 \
              PG         194  |
              PG-13      223  | Third grouping set: (rating)
              R          195  |
              NC-17      210 /
          1   G          178 \
          1   PG         194  |
          1   PG-13      223  | Fourth grouping set: (language_id, rating)
          1   R          195  |
          1   NC-17      210 /

Of course, this would all be more impressive if we had more than one language in the system…

So, how do we solve the original problem with GROUPING SETS? Here’s how:

SELECT 
  GROUPING_ID (length, language_id, rating),
  length,
  language_id,
  rating,
  count(*)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
GROUP BY GROUPING SETS (
  (),
  (length),
  (language_id),
  (rating),
  (length, language_id),
  (length, rating),
  (rating, language_id),
  (length, language_id, rating)
)
HAVING COALESCE (length, 1) != 0 
AND COALESCE (language_id, 1) != 0 
AND COALESCE (rating, 1) != 0
ORDER BY GROUPING_ID (length, language_id, rating) DESC

Wow. How to read this? In 4 steps:

Step 1: Again, the derived table

This time, we’ll encode FALSE as 0, not NULL, because NULL already has a different meaning in GROUPING SETS. It means that for a given GROUPING SET, we didn’t group by that column. We’ll see that in step 3.

Step 2: The GROUPING SETS

In this section, we’re just listing all the possible combinations of GROUP BY columns that we want to use, which produces 8 distinct GROUPING SETS. I’ve already explained this in the previous introduction to GROUPING SETS, so this is no different.

Step 3: Filter out unwanted groupings

Just like in the PIVOT example, we’re also getting results for which the predicates are FALSE, but we don’t want those in the result. So we’re filtering them out in the HAVING clause:

SELECT 
  ...
HAVING COALESCE (length, 1) != 0 
AND COALESCE (language_id, 1) != 0 
AND COALESCE (rating, 1) != 0
...

How to read this? E.g. LENGTH can be any of:

  • 1: The length predicate was TRUE
  • 0: The length predicate was FALSE
  • NULL: The length column is not considered for a given GROUPING SET, e.g. () or (rating, language_id)

So, using COALESCE, we’re making sure that we include only 1 and NULL lengths, not 0 lengths.

Step 4: Ordering the results

This is optional, but in order to get the same output order as before, we can use the special GROUPING_ID() (or GROUPING() depending on the DB) function which returns an ID for each GROUPING SET. The output is:

grouping   length   language_id   rating   count
------------------------------------------------
       7     NULL          NULL     NULL    1000
       6     NULL          NULL        1     194
       5     NULL             1     NULL    1000
       4     NULL             1        1     194
       3        1          NULL     NULL     224
       2        1          NULL        1      43
       1        1             1     NULL     224
       0        1             1        1      43

Excellent! And hey, there’s even syntax sugar for “special” GROUPING SETS configurations like ours, where we list all the possible column permutations. In this case, we can use CUBE()!

SELECT 
  GROUPING_ID (length, language_id, rating),
  length,
  language_id,
  rating,
  count(*)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
GROUP BY CUBE (length, language_id, rating)
HAVING COALESCE(length, 1) != 0 
AND COALESCE(language_id, 1) != 0 
AND COALESCE(rating, 1) != 0
ORDER BY GROUPING_ID (length, language_id, rating) DESC

Performance

Such a comparison blog post wouldn’t be complete if we wouldn’t benchmark for performance. This time, I’ll be benchmarking only for Oracle, as PostgreSQL doesn’t support PIVOT and SQL Server’s PIVOT is more limited than Oracle’s.

Here’s the complete benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP

    -- Individual statements
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT count(*) FROM film
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE language_id = 1
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND language_id = 1
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE language_id = 1
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND language_id = 1
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts));

    -- Manual PIVOT
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT
          count(*),
          count(length),
          count(language_id),
          count(rating),
          count(length + language_id),
          count(length + rating),
          count(language_id + rating),
          count(length + language_id + rating)
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
            CASE WHEN language_id = 1            THEN 1 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 END rating
          FROM film
        ) film
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts));
    
    -- PIVOT
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          a + b + c + d + e + f + g + h,
                          e + f + g + h,
                  c + d         + g + h,
              b     + d     + f     + h,
                                  g + h,
                              f     + h,
                      d             + h,
                                      h
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 ELSE 0 END length,
            CASE WHEN language_id = 1            THEN 1 ELSE 0 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 ELSE 0 END rating
          FROM film
        ) film
        PIVOT (
          count(*) FOR (length, language_id, rating) IN (
            (0, 0, 0) AS a,
            (0, 0, 1) AS b,
            (0, 1, 0) AS c,
            (0, 1, 1) AS d,
            (1, 0, 0) AS e,
            (1, 0, 1) AS f,
            (1, 1, 0) AS g,
            (1, 1, 1) AS h
          )
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 3 : ' || (SYSTIMESTAMP - v_ts));

    -- GROUPING SETS
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          GROUPING_ID (length, language_id, rating),
          length,
          language_id,
          rating,
          count(*)
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 ELSE 0 END length,
            CASE WHEN language_id = 1            THEN 1 ELSE 0 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 ELSE 0 END rating
          FROM film
        ) film
        GROUP BY CUBE (length, language_id, rating)
        HAVING COALESCE (length, 1) != 0 
        AND COALESCE (language_id, 1) != 0 
        AND COALESCE (rating, 1) != 0
        ORDER BY GROUPING_ID (length, language_id, rating) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 4 : ' || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

And the results:

Run 1, Statement 1 : +000000000 00:00:01.928497000
Run 1, Statement 2 : +000000000 00:00:01.136341000
Run 1, Statement 3 : +000000000 00:00:02.751679000
Run 1, Statement 4 : +000000000 00:00:00.797529000

Run 2, Statement 1 : +000000000 00:00:01.695543000
Run 2, Statement 2 : +000000000 00:00:01.004073000
Run 2, Statement 3 : +000000000 00:00:02.490895000
Run 2, Statement 4 : +000000000 00:00:00.838979000

Run 3, Statement 1 : +000000000 00:00:01.634047000
Run 3, Statement 2 : +000000000 00:00:01.016266000
Run 3, Statement 3 : +000000000 00:00:02.566895000
Run 3, Statement 4 : +000000000 00:00:00.790159000

Run 4, Statement 1 : +000000000 00:00:01.669844000
Run 4, Statement 2 : +000000000 00:00:01.015502000
Run 4, Statement 3 : +000000000 00:00:02.574646000
Run 4, Statement 4 : +000000000 00:00:00.807804000

Run 5, Statement 1 : +000000000 00:00:01.653498000
Run 5, Statement 2 : +000000000 00:00:00.980375000
Run 5, Statement 3 : +000000000 00:00:02.556186000
Run 5, Statement 4 : +000000000 00:00:00.890283000

Very disappointingly, the PIVOT solution is the slowest every time. I’m assuming there’s some substantial temporary object overhead which wouldn’t be as severe if the table were much larger, but clearly, the manual PIVOT solution (COUNT(CASE ...)) and the GROUPING SETS solution heavily outperform the initial attempt, where we calculate 8 counts individually.

To get back to the original report where 32 counts were calculated: The report was roughly 20x as fast with manual PIVOT on 200M rows and imagine if you need to JOIN – you definitely want to avoid those 32 individual queries and calculate everything in one go.

Cheers!

jOOQ Tuesdays: Richard North Makes Database Testing More Reproducible with Testcontainers

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Richard North, creator of Testcontainers, a promising new database and UI testing tool for reproducible tests.

Hi Richard – you work at Skyscanner, which means you have tons of travel data to work with. What’s the most exciting thing about working with this data?

Skyscanner is a really data-led company, using data at all levels in decision-making. The volume of data we gather and process really helps us understand what travellers want and serve them better. For example, our destination recommender system helps with people discover new and interesting places to go, based upon a vast amount of data, algorithms and experiments. But it’s interesting how this varies from recommendations in internet media companies – there are far fewer possible destinations than books, songs and movies, yet users’ reasons for travelling and tastes can be more nuanced and varied.

There’s yet more data-oriented work under the surface, too – for example, the infrastructure needed to gather and analyse such large amounts of data, and the running of experiments to help us improve. There are a lot of smart people working hard to make this all happen, and it’s an exciting place to be!

You’ve created an increasingly popular testing framework, Testcontainers. What made you do it? What itch does it scratch?

Well, I think it’s something that scratches several itches at the same time – things that previously we only had isolated solutions to. The common element, though, is reproducibility of test environments. Of all my time developing and writing tests for JVM-based systems, it’s always been the non-JVM dependencies that caused the most complexity, unreliability and maintenance overhead.

I remember my first day as a developer, years ago: I was given a desktop machine and 2 days’ worth of step-by-step instructions that I needed to follow – just so that I’d be able to develop and run tests with all dependencies in place. A few months later I had to repeat the same task many times over when building new CI servers.

A lot has changed since then in terms of how we deploy and manage our production infrastructure, and thankfully Docker has done a lot to further bring prod-parity to developers’ machines.

Testcontainers started out as my effort to bring the full power of Docker to integrated testing on the JVM in two areas that I’ve experienced the most pain: testing against a clean, representative database, and making browser-based selenium testing more reproducible, both for developers and on CI.

Mostly being curious about testing databases, your documentation mentions Testcontainers as an alternative to using H2 as a test database. What are the disadvantages of emulating a database e.g. with H2? Did you make any personal experience with that?

Yes, definitely – it was one of the tipping point factors that triggered me to create Testcontainers. I do think H2 is a fantastic piece of work in what it manages to deliver, and it’s something I’ve used on a number of projects to good effect.

However, compatibility with real databases has often been a sticking point. Back in 2015, before I started Testcontainers, we were struggling with a few MySQL features that didn’t have equivalents in H2. We were facing the unpleasant prospect of having to constrain our implementation to what H2 would allow us to test against. It became fairly obvious that there was a gap in the market for an H2-like tool that was actually a facade to a Docker-based database container – and Testcontainers was born.

How do you think of mocking the database at any layer, including the DAO layer, service layer, etc.?

I’m all in favour of keeping tests small, light and layered, and using mocks to accomplish this. This might sound strange coming from somebody who has developed an integrated testing tool, but it’s true!

Still, I feel that we need to be pragmatic about how we approach automated tests and how we make sure we’re testing the right thing – especially when crossing boundaries. Are we testing how this code behaves against reality, or are we testing against our own (potentially false) understanding of how external components work?

My feeling is that it’s quite straightforward to mock layers of your system that you yourself wrote, or where you can easily jump into the source code, a spec or documentation. With an external component, you can still produce a mock that behaves how you expect, or how you witness the real thing behaving. But does that mock continue to represent the real thing, especially after accretion of other features, or the additional perils of state that a database entails – schema changes and actual data?

My ideal is to mock the data access layer for consumption by higher layers, but to be quite careful about what the data access layer itself talks to in my tests. It should probably be a real database. Hopefully Testcontainers is one tool that helps make this particular thing a little less painful – so that when you find yourself needing to do this, there’s a way to do it easily.

What’s the biggest challenge you’ve faced when testing databases, or other things?

It’s not databases, but I’d say that by far the hardest testing challenge I’ve faced as a developer is mobile apps, especially iOS. I’ve always enjoyed mobile development  as a whole, but when switching from a Java server-side/web project to mobile, it really feels like you’re going back in time. Some of the challenges are harder – such as asynchronicity and platform APIs that make it harder to structure software in a testable way. But it also feels like the tooling is much further behind, and until quite recently received far less attention. I feel the net result has been that developers have been discouraged from investing in automated tests, which is sad given that we know how valuable they can be.

Things are getting better, but I do greatly prefer the testing aspects of working on server-side JVM projects. For all its difficulties, we are actually quite lucky!

How to Fetch Multiple Oracle Execution Plans in One Nice Query

When looking at execution plans in Oracle, we’ll have to do several steps to be able to call the DBMS_XPLAN package functions. In fact, we have to find out the SQL_ID for a given statement first, and only then we can get its plan. I’ve blogged about this previously, here.

However, thanks to lateral unnesting, we can do the two steps in one go. More than that, we can even fetch several plans in one go this way! Check this out…

Let’s run the queries from the previous benchmarking blog post, and let’s add the /*+GATHER_PLAN_STATISTICS*/ hint to get some actual execution values, not just estimates:

SELECT /*+GATHER_PLAN_STATISTICS*/ 
  first_name, last_name, count(fa.actor_id) AS c
FROM actor a
LEFT JOIN film_actor fa
ON a.actor_id = fa.actor_id
WHERE last_name LIKE 'A%'
GROUP BY a.actor_id, first_name, last_name
ORDER BY c DESC
;

SELECT /*+GATHER_PLAN_STATISTICS*/
  first_name, last_name, (
    SELECT count(*)
    FROM film_actor fa
    WHERE a.actor_id =
    fa.actor_id
  ) AS c
FROM actor a
WHERE last_name LIKE 'A%' 
ORDER BY c DESC
;

Both queries do the same thing. They try to find those actors whose last name starts with the letter A and counts their corresponding films. Now what about the execution plans? Run the following query and you don’t have to know any SQL_ID in advance:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

As you can see, with “LATERAL unnesting”, we can unnest a nested table (as returned by DBMS_XPLAN.DISPLAY_CURSOR) into the calling query, and we can pass column values from the V$SQL table to each function call. In other words, the above query reads as:

  • Get all SQL statements from the cursor cache V$SQL
  • Keep only those who have our GATHER_PLAN_STATISTICS hint in them (replace with your own query matching pattern)
  • Cross-join the unnested table from DBMS_XPLAN.DISPLAY_CURSOR where we get the plan per SQL_ID and CHILD_NUMBER

This implicit “LATERAL unnesting” is a bit obscure in my opinion (but its brief). More formally correct would be to use the actual LATERAL keyword, or better the SQL Server style CROSS APPLY

-- LATERAL: A bit verbose
SELECT s.sql_id, p.*
FROM v$sql s CROSS JOIN LATERAL (SELECT * FROM TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
)) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

-- CROSS APPLY: Very neat!
SELECT s.sql_id, p.*
FROM v$sql s CROSS APPLY TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

In any case, the result is the same (I’ve removed some columns for brevity):

SQL_ID          PLAN_TABLE_OUTPUT
3gv1fd3dcj3b0	SQL_ID  3gv1fd3dcj3b0, child number 0
3gv1fd3dcj3b0	-------------------------------------
3gv1fd3dcj3b0	SELECT /*+GATHER_PLAN_STATISTICS*/ first_name, last_name, 
3gv1fd3dcj3b0	count(fa.actor_id) AS c FROM actor a LEFT JOIN film_actor fa ON 
3gv1fd3dcj3b0	a.actor_id = fa.actor_id WHERE last_name LIKE 'A%' GROUP BY a.actor_id, 
3gv1fd3dcj3b0	first_name, last_name ORDER BY c DESC
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Plan hash value: 3014447605
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	| Id  | Operation                              | Name                    | Starts | E-Rows | A-Rows |
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	|   0 | SELECT STATEMENT                       |                         |      1 |        |      7 |
3gv1fd3dcj3b0	|   1 |  SORT ORDER BY                         |                         |      1 |      7 |      7 |
3gv1fd3dcj3b0	|   2 |   HASH GROUP BY                        |                         |      1 |      7 |      7 |
3gv1fd3dcj3b0	|*  3 |    HASH JOIN OUTER                     |                         |      1 |    154 |    196 |
3gv1fd3dcj3b0	|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                   |      1 |      6 |      7 |
3gv1fd3dcj3b0	|*  5 |      INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME     |      1 |      6 |      7 |
3gv1fd3dcj3b0	|   6 |     INDEX FAST FULL SCAN               | IDX_FK_FILM_ACTOR_ACTOR |      1 |   5462 |   5462 |
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Predicate Information (identified by operation id):
3gv1fd3dcj3b0	---------------------------------------------------
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	   3 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
3gv1fd3dcj3b0	   5 - access("A"."LAST_NAME" LIKE 'A%')
3gv1fd3dcj3b0	       filter("A"."LAST_NAME" LIKE 'A%')
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Note
3gv1fd3dcj3b0	-----
3gv1fd3dcj3b0	   - dynamic statistics used: dynamic sampling (level=2)
3gv1fd3dcj3b0	   - this is an adaptive plan
3gv1fd3dcj3b0	   - 1 Sql Plan Directive used for this statement
3gv1fd3dcj3b0	 
6a3nrpcw22avr	SQL_ID  6a3nrpcw22avr, child number 0
6a3nrpcw22avr	-------------------------------------
6a3nrpcw22avr	SELECT /*+GATHER_PLAN_STATISTICS*/ first_name, last_name, (   SELECT 
6a3nrpcw22avr	count(*)   FROM film_actor fa   WHERE a.actor_id =   fa.actor_id ) AS c 
6a3nrpcw22avr	FROM actor a WHERE last_name LIKE 'A%'  ORDER BY c DESC
6a3nrpcw22avr	 
6a3nrpcw22avr	Plan hash value: 3873085786
6a3nrpcw22avr	 
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	| Id  | Operation                            | Name                    | Starts | E-Rows | A-Rows |
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	|   0 | SELECT STATEMENT                     |                         |      1 |        |      7 |
6a3nrpcw22avr	|   1 |  SORT AGGREGATE                      |                         |      7 |      1 |      7 |
6a3nrpcw22avr	|*  2 |   INDEX RANGE SCAN                   | IDX_FK_FILM_ACTOR_ACTOR |      7 |     27 |    196 |
6a3nrpcw22avr	|   3 |  SORT ORDER BY                       |                         |      1 |      6 |      7 |
6a3nrpcw22avr	|   4 |   TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                   |      1 |      6 |      7 |
6a3nrpcw22avr	|*  5 |    INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME     |      1 |      6 |      7 |
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	 
6a3nrpcw22avr	Predicate Information (identified by operation id):
6a3nrpcw22avr	---------------------------------------------------
6a3nrpcw22avr	 
6a3nrpcw22avr	   2 - access("FA"."ACTOR_ID"=:B1)
6a3nrpcw22avr	   5 - access("LAST_NAME" LIKE 'A%')
6a3nrpcw22avr	       filter("LAST_NAME" LIKE 'A%')
6a3nrpcw22avr	 

This is really neat!

SQL IN Predicate: With IN List or With Array? Which is Faster?

Hah! Got nerd-sniped again:

http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102

A jOOQ user was wondering why jOOQ would generate an IN list for a predicate like this:

Java

COLUMN.in(1, 2, 3, 4)

SQL

COLUMN in (?, ?, ?, ?)

… when in fact there could have been the following predicate being generated, instead:

COLUMN = any(?::int[])

In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN list of size 4, but let’s imagine a list of 50 values).

A disclaimer

First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.

I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.

Measure, don’t guess

The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:

Is an IN list really that bad in PostgreSQL?

Would an array bind variable be much better?

Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:

-- IN list
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id IN (?, ?, ?, ?)

-- Array
SELECT * 
FROM film 
JOIN film_actor USING (film_id) 
JOIN actor USING (actor_id) 
WHERE film_id = ANY(?)

Let’s try lists of length 4, first. The benchmark is here:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4];
BEGIN
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', 
      r, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT * 
        FROM film 
        JOIN film_actor USING (film_id) 
        JOIN actor USING (actor_id) 
        WHERE film_id = ANY(v_any_arr)
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', 
      r, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result being:

INFO:  Run 1, Statement 1: 00:00:00.112195
INFO:  Run 1, Statement 2: 00:00:00.450461
INFO:  Run 2, Statement 1: 00:00:00.109792
INFO:  Run 2, Statement 2: 00:00:00.446518
INFO:  Run 3, Statement 1: 00:00:00.105413
INFO:  Run 3, Statement 2: 00:00:00.44298
INFO:  Run 4, Statement 1: 00:00:00.108249
INFO:  Run 4, Statement 2: 00:00:00.476527
INFO:  Run 5, Statement 1: 00:00:00.120229
INFO:  Run 5, Statement 2: 00:00:00.448214

Interesting. So, the IN list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:

-- values
  v_e1 INT := 1;
  v_e2 INT := 2;
  v_e3 INT := 4;
  v_e4 INT := 8;
  v_e5 INT := 16;
  v_e6 INT := 32;
  v_e7 INT := 64;
  v_e8 INT := 128;
  v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8];

-- adapted query 1 ...
        WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8)
-- ...

The result is still impressive:

INFO:  Run 1, Statement 1: 00:00:00.182646
INFO:  Run 1, Statement 2: 00:00:00.63624
INFO:  Run 2, Statement 1: 00:00:00.184814
INFO:  Run 2, Statement 2: 00:00:00.685976
INFO:  Run 3, Statement 1: 00:00:00.188108
INFO:  Run 3, Statement 2: 00:00:00.634903
INFO:  Run 4, Statement 1: 00:00:00.184933
INFO:  Run 4, Statement 2: 00:00:00.626616
INFO:  Run 5, Statement 1: 00:00:00.185879
INFO:  Run 5, Statement 2: 00:00:00.636723

The IN list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN list, or 32 array elements respectively:

INFO:  Run 1, Statement 1: 00:00:00.905064
INFO:  Run 1, Statement 2: 00:00:00.752819
INFO:  Run 2, Statement 1: 00:00:00.760475
INFO:  Run 2, Statement 2: 00:00:00.758247
INFO:  Run 3, Statement 1: 00:00:00.777667
INFO:  Run 3, Statement 2: 00:00:00.895875
INFO:  Run 4, Statement 1: 00:00:01.308167
INFO:  Run 4, Statement 2: 00:00:00.789537
INFO:  Run 5, Statement 1: 00:00:00.788606
INFO:  Run 5, Statement 2: 00:00:00.776159

Both are about equally fast. 64 bind values!

INFO:  Run 1, Statement 1: 00:00:00.915069
INFO:  Run 1, Statement 2: 00:00:01.058966
INFO:  Run 2, Statement 1: 00:00:00.951488
INFO:  Run 2, Statement 2: 00:00:00.906285
INFO:  Run 3, Statement 1: 00:00:00.907489
INFO:  Run 3, Statement 2: 00:00:00.892393
INFO:  Run 4, Statement 1: 00:00:00.900424
INFO:  Run 4, Statement 2: 00:00:00.903447
INFO:  Run 5, Statement 1: 00:00:00.961805
INFO:  Run 5, Statement 2: 00:00:00.951697

Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.

Yep, as expected. Finally, arrays start to outperform IN lists:

INFO:  Run 1, Statement 1: 00:00:01.122866
INFO:  Run 1, Statement 2: 00:00:01.083816
INFO:  Run 2, Statement 1: 00:00:01.416469
INFO:  Run 2, Statement 2: 00:00:01.134882
INFO:  Run 3, Statement 1: 00:00:01.122723
INFO:  Run 3, Statement 2: 00:00:01.087755
INFO:  Run 4, Statement 1: 00:00:01.143148
INFO:  Run 4, Statement 2: 00:00:01.124902
INFO:  Run 5, Statement 1: 00:00:01.236722
INFO:  Run 5, Statement 2: 00:00:01.113741

Using Oracle

Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).

Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):

4 bind values

Run 1, Statement 1 : 01.911000000
Run 1, Statement 2 : 02.852000000
Run 2, Statement 1 : 01.659000000
Run 2, Statement 2 : 02.680000000
Run 3, Statement 1 : 01.628000000
Run 3, Statement 2 : 02.664000000
Run 4, Statement 1 : 01.629000000
Run 4, Statement 2 : 02.657000000
Run 5, Statement 1 : 01.636000000
Run 5, Statement 2 : 02.688000000

128 bind values

Run 1, Statement 1 : 04.010000000
Run 1, Statement 2 : 06.275000000
Run 2, Statement 1 : 03.749000000
Run 2, Statement 2 : 05.440000000
Run 3, Statement 1 : 03.985000000
Run 3, Statement 2 : 05.387000000
Run 4, Statement 1 : 03.807000000
Run 4, Statement 2 : 05.688000000
Run 5, Statement 1 : 03.782000000
Run 5, Statement 2 : 05.803000000

The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/ hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN list one was not affected… Weird?

Conclusion

This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.

But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN lists.

But chances are, that IN lists may be faster for single executions.

In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.

How to Benchmark Alternative SQL Queries to Find the Fastest Query

Tuning SQL isn’t always easy, and it takes a lot of practice to recognise how any given query can be optimised. One of the most important slides of my SQL training is the one summarising “how to be fast”:

How to be fast with SQL. Find out with the Data Geekery SQL Training

Some of these bullets were already covered on this blog. For instance avoiding needless, mandatory work, when client code runs queries or parts of queries that aren’t really necessary (e.g. selecting too many columns: “needless”), but the database cannot prove they’re needless, thus: “mandatory” for the database to execute.

But as with many other performance related topics, one key message is not to guess, but to measure! Or, in other words, not to optimise prematurely, but to optimise actual problems.

SQL is full of myths

SQL is a 4GL (Fourth-generation programming language) and as such, has always been a cool, convenient way to express data related constraints and queries. But the declarative nature of the language also often meant that programmers are really looking into a crystal ball. A lot of people have blogged about a lot of half-true discoveries that might have been correct in some context and at some point of time (this blog is no exception).

For instance:

  • Are correlated subqueries slower than their LEFT JOIN equivalents?
  • Are derived tables faster than views or common table expressions?
  • Is COUNT(*) faster than COUNT(1)?

Tons of myhts!

Measure your queries

To bust a myth, if you have good reasons to think that a differently written, but semantically equivalent query might be faster (on your database), you should measure. Don’t even trust any execution plan, because ultimately, what really counts is the wall clock time in your production system.

If you can measure your queries in production, that’s perfect. But often, you cannot – but you don’t always have to. One way to compare two queries with each other is to benchmark them by executing each query hundreds or even thousands of times in a row.

As any technique, benchmarking has pros and cons. Here is a non-exhaustive list:

Pros

  • Easy to do (see examples below)
  • Easy to reproduce, also on different environments
  • Easy to quickly get an idea in terms of orders of magnitude difference

Cons

  • Not actually measuring productive situations (no one runs the same query thousands of times in a row, without any other queries in parallel)
  • Queries may profit from unrealistic caching due to heavy repetition
  • “Real query” might be dynamic, so the “same query” might really manifest itself in dozens of different productive queries

But if you’re fine with the cons above, the pros might outweigh, for instance, if you want to find out whether a correlated subquery is slower than its LEFT JOIN equivalent for a given query. Note my using italics here, because even if you find out it’s slower for that given query it might be faster for other queries. Never jump to generalised rules before measuring again!

For instance, consider these two equivalent queries that run on the Sakila database. Both versions try to find those actors whose last name starts with the letter A and counts their corresponding films:

LEFT JOIN

SELECT first_name, last_name, count(fa.actor_id) AS c
FROM actor a
LEFT JOIN film_actor fa
ON a.actor_id = fa.actor_id
WHERE last_name LIKE 'A%'
GROUP BY a.actor_id, first_name, last_name
ORDER BY c DESC

Correlated subquery

SELECT first_name, last_name, (
  SELECT count(*)
  FROM film_actor fa
  WHERE a.actor_id =
  fa.actor_id
) AS c
FROM actor a
WHERE last_name LIKE 'A%' 
ORDER BY c DESC

The result is always:

The queries have different execution plans on PostgreSQL, Oracle, SQL Server as can be seen below:

PostgreSQL LEFT JOIN

(Plan looks “better”)

PostgreSQL correlated subquery

(Plan looks “worse”)

Oracle LEFT JOIN

(Plan looks “more complicated”)

Oracle correlated subquery

(Plan looks “simpler”)

SQL Server LEFT JOIN

(Plan looks “reasonable”)

SQL Server correlated subquery

(Plan looks… geez, where’s my correlated subquery? It’s been transformed to a LEFT JOIN!)

Huh, as you can see, in SQL Server, both queries produce the exact same plan (as they should, because the queries are really equivalent). But not all databases recognise this and/or optimise this. At least, that’s what the estimated plans suggest.

Also, don’t jump to the conclusion that if the cost of one plan is lower then it’s a better plan than an alternative. Costs can only really be compared when comparing alternative plans for the same query, e.g. in the Oracle example, we had both HASH JOIN and NESTED LOOP JOIN in a single plan, because Oracle 12c may collect runtime statistics and switch plans in flight thanks to the Oracle 12c Adaptive Query Optimization features.

But let’s ignore all of this and look at actual execution times, instead:

Benchmarking the alternatives

As always, disclaimer: Some commercial databases do not allow for publishing benchmark results without prior written consent. As I never ask for permission, but always ask for forgiveness, I do not have consent, and I’m thus not publishing actual benchmark results.

I have anonymized the benchmark results by introducing hypothetical, non-comparable units of measurement, so you cannot see that PostgreSQL is totally slower than Oracle and/or SQL Server. And you cannot see that SQL Server’s procedural language is totally uglier than PostgreSQL’s and/or Oracle’s.

Legal people.

Solving problems we wouldn’t have without legal people, in the first place

Enough ranting. Some important considerations:

  • Ideally, you’ll run benchmarks directly in the database using a procedural language, rather than, e.g. over JDBC to avoid network latency that incurs with JDBC calls, and other non-desired side-effects.
  • Repeat the benchmarks several times to prevent warmup side-effects and other random issues, as your OS / file system may be busy with accidental Scala compilation, or Slack UI refreshes
  • Be sure to actually consume the entire result set of each query in a loop, rather than just executing the query. Some databases may optimise for lazy cursor consumption (and possibly abortion). It would be unfair not to consume the entire result set

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR i IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, count(fa.actor_id) AS c
        FROM actor a
        LEFT JOIN film_actor fa
        ON a.actor_id = fa.actor_id
        WHERE last_name LIKE 'A%'
        GROUP BY a.actor_id, first_name, last_name
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', i, (clock_timestamp() - v_ts); 
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, (
          SELECT count(*)
          FROM film_actor fa
          WHERE a.actor_id =
          fa.actor_id
        ) AS c
        FROM actor a
        WHERE last_name LIKE 'A%' 
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', i, (clock_timestamp() - v_ts); 
  END LOOP;
END$$;

The result is:

INFO:  Run 1, Statement 1: 00:00:01.708257
INFO:  Run 1, Statement 2: 00:00:01.252012
INFO:  Run 2, Statement 1: 00:00:02.33151  -- Slack message received here
INFO:  Run 2, Statement 2: 00:00:01.064007
INFO:  Run 3, Statement 1: 00:00:01.638518
INFO:  Run 3, Statement 2: 00:00:01.149005
INFO:  Run 4, Statement 1: 00:00:01.670045
INFO:  Run 4, Statement 2: 00:00:01.230755
INFO:  Run 5, Statement 1: 00:00:01.81718
INFO:  Run 5, Statement 2: 00:00:01.166089

As you can see, in all 5 benchmark executions, the version with the correlated subquery seemed to have outperformed the version with the LEFT JOIN in this case by roughly 60%! As this is PostgreSQL and open source, benchmark results are in actual seconds for 10000 query executions. Neat. Let’s move on to…

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, count(fa.actor_id) AS c
        FROM actor a
        LEFT JOIN film_actor fa
        ON a.actor_id = fa.actor_id
        WHERE last_name LIKE 'A%'
        GROUP BY a.actor_id, first_name, last_name
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r || ', Statement 1 : ' || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name, (
          SELECT count(*)
          FROM film_actor fa
          WHERE a.actor_id =
          fa.actor_id
        ) AS c
        FROM actor a
        WHERE last_name LIKE 'A%' 
        ORDER BY c DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r || ', Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

Gee, check out the difference now (and remember, these are totally not seconds, but a hypothetical unit of measurment, let’s call them Newtons. Or Larrys. Let’s call them Larrys (great idea, Axel)):

Run 1, Statement 1 : 07.721731000
Run 1, Statement 2 : 00.622992000
Run 2, Statement 1 : 08.077535000
Run 2, Statement 2 : 00.666481000
Run 3, Statement 1 : 07.756182000
Run 3, Statement 2 : 00.640541000
Run 4, Statement 1 : 07.495021000
Run 4, Statement 2 : 00.731321000
Run 5, Statement 1 : 07.809564000
Run 5, Statement 2 : 00.632615000

Wow, the correlated subquery totally outperformed the LEFT JOIN query by an order of magnitude. This is totally insane. Now, check out…

SQL Server

… beautiful procedural language in SQL Server: Transact-SQL. With nice features like:

  • Needing to cast INT values to VARCHAR when concatenating them.
  • No indexed loop, only WHILE loop
  • No implicit cursor loops (instead: DEALLOCATE!)

Oh well. It’s just for a benchmark. So here goes:

DECLARE @ts DATETIME;
DECLARE @repeat INT = 10000;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy1 VARCHAR;
DECLARE @dummy2 VARCHAR;
DECLARE @dummy3 INT;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @r = 0;
WHILE @r < 5
BEGIN
  SET @r = @r + 1

  SET @s1 = CURSOR FOR 
    SELECT first_name, last_name, count(fa.actor_id) AS c
    FROM actor a
    LEFT JOIN film_actor fa
    ON a.actor_id = fa.actor_id
    WHERE last_name LIKE 'A%'
    GROUP BY a.actor_id, first_name, last_name
    ORDER BY c DESC

  SET @s2 = CURSOR FOR 
    SELECT first_name, last_name, (
      SELECT count(*)
      FROM film_actor fa
      WHERE a.actor_id =
      fa.actor_id
    ) AS c
    FROM actor a
    WHERE last_name LIKE 'A%' 
    ORDER BY c DESC

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s1;
    FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s1 INTO @dummy1, @dummy2, @dummy3;
    END;

    CLOSE @s1;
  END;

  DEALLOCATE @s1;
  PRINT 'Run ' + CAST(@r AS VARCHAR) + ', Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

  SET @ts = current_timestamp;
  SET @i = 0;
  WHILE @i < @repeat
  BEGIN
    SET @i = @i + 1

    OPEN @s2;
    FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3;
    WHILE @@FETCH_STATUS = 0
    BEGIN
      FETCH NEXT FROM @s2 INTO @dummy1, @dummy2, @dummy3;
    END;

    CLOSE @s2;
  END;

  DEALLOCATE @s2;
  PRINT 'Run ' + CAST(@r AS VARCHAR) + ', Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';
END;

And again, remember, these aren’t seconds. Really. They’re … Kilowatts. Yeah, let’s settle with kilowatts.

Run 1, Statement 1:  2626
Run 1, Statement 2: 20340
Run 2, Statement 1:  2450
Run 2, Statement 2: 17910
Run 3, Statement 1:  2706
Run 3, Statement 2: 18396
Run 4, Statement 1:  2696
Run 4, Statement 2: 19103
Run 5, Statement 1:  2716
Run 5, Statement 2: 20453

Oh my… Wait a second. Now suddenly, the correlated subquery is factor 5… more energy consuming (remember: kilowatts). Who would have thought?

Conclusion

This article won’t explain the differences in execution time between the different databases. There are a lot of reasons why a given execution plan will outperform another. There are also a lot of reasons why the same plan (at least what looks like the same plan) really isn’t because a plan is only a description of an algorithm. Each plan operation can still contain other operations that might still be different.

In summary, we can say that in this case (I can’t stress this enough. This isn’t a general rule. It only explains what happens in this case. Don’t create the next SQL myth!), the correlated subquery and the LEFT JOIN performed in the same order of magnitude on PostgreSQL (subquery being a bit faster), the correlated subquery drastically outperformed the LEFT JOIN in Oracle, whereas the LEFT JOIN drastically outperformed the correlated subquery in SQL Server (despite the plan having been the same!)

This means:

  • Don’t trust your intitial judgment
  • Don’t trust any historic blog posts saying A) is faster than B)
  • Don’t trust execution plans
  • Don’t trust this blog post here, because it is using uncomparable time scales (seconds vs newtons vs kilowatts)
  • Don’t fully trust your own benchmarks, because you’re not measuring things as they happen in production

And sadly:

  • Even for such a simple query, there’s no optimal query for all databases

(and I haven’t even included MySQL in the benchmarks)

BUT

by measuring two alternative, equivalent queries, you may just get an idea what might perform better for your system in case you do have a slow query somewhere. Perhaps this helps.

And now that you’re all hot on the subject, go book our 2 day SQL training, where we have tons of other interesting, myth busting content!

The Open-Closed Principle is Often Not What You Think it Is

jOOQ is a library that loves making everything internal final and package private. We have tons of classes like these:

final class Concat extends AbstractFunction<String> {
    // ...
}

The class implements the semantics of SQL string concatenation. Clearly, you shouldn’t need to tamper with it (or even know about it), because it is “protected” behind the corresponding public API in the DSL class:

// You can see this:
public class DSL {

    // You can see this but not override it:
    public static Field<String> concat(Field<?>... fields) {

        // But you cannot do this, yourself:
        return new Concat(nullSafe(fields));
    }
}

Now, in the past decades, there have been a lot of software design movements that were contrary to the concept of encapsulation in some ways. The driving powers of that were:

A fun to read example of “slightly” (i.e. completely) exaggerated advocacy of extreme application of object orientation is Yegor Bugayenko’s blog:

http://www.yegor256.com

Through exaggeration, he makes some really interesting points that make you think. Of course, you have to be able to accept the hyperboles as non-facts. Not everyone can do that, so don’t get angry reading 😉

Let’s look at the open-closed principle

The open-closed principle claims, according to Wikipedia:

In object-oriented programming, the open/closed principle states “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification”; that is, such an entity can allow its behaviour to be extended without modifying its source code.

This is a very desireable aspect of some software entities. For instance, it is always true for an SPI (Service Provider Interface), by design, of course. Let’s read the Wikipedia definition of an SPI:

Service Provider Interface (SPI) is an API intended to be implemented or extended by a third party. It can be used to enable framework extension and replaceable components

Perfect. For instance, a jOOQ Converter is a SPI. We’ve just published a recent post about how to use the Converter API in a strategy pattern style with lambdas – the strategy pattern works really well with SPIs.

In fact, the strategy pattern isn’t even strictly an object oriented feature, you can get it for free in functional programming without giving it a fancy name. It’s just any ordinary higher order function.

Another fine example of what could be considered an SPI is an Iterable. While Iterable subtypes like List are more often used as APIs (user is the consumer) rather than SPIs (user is the implementor), the Iterable API itself is more of a way of providing the functionality required to run code inside of a foreach loop. For instance, jOOQ’s ResultQuery implements Iterable, which allows it to be used in a foreach loop:

for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)) { // Automatic execution, fetching
 
    doThingsWithRecord(rec);
}

So, clearly, it can be said that:

  • Iterable follows the open-closed principle as it models an entity that is open for extension (I can produce my own iterable semantics), but closed for modification (I won’t ever modify the Java compiler and/or the foreach loop semantics
  • The Liskov substitution principle is also followed trivially, as the foreach loop doesn’t care at all about how I implement my Iterable, as long as it behaves like one (providing an Iterator)

That was easy

But when does it not apply?

In a lot of situations. For instance, jOOQ is in many ways not designed for object oriented extension. You simply should not:

  • Mock the concat() function.
    You might be tempted to do so, as you might think that you need to unit test everything, including third party libraries, and then you have to mock out the string concatenation feature inside of your database. But it doesn’t work. The DSL.concat() method is static, and the implementation hidden. No way you could replace it with ordinary means (there are some dirty tricks).

    But hold on for a second. Why are you even doing this? Aren’t integration tests the better way here? Do you really have time (and want to spend it) on replacing entire complex implementations with your mocks? I don’t think so. That hardly every works

  • Modify the concatenation behaviour for some use-case.
    While you may think that sometimes, you’d just like to tweak an implementation a little bit to get a quick win, that is certainly not the intent of the authors of the open-closed principle or the Lishkov substitution principle. We as API designers don’t want you to extend all of our functionality. As simple as that. Why? Because we want you to get in touch with us to help us improve our software for everyone, rather than you tweaking something for a quick win.

Let this sink in – especially the latter.

The premise that everything should be object oriented and everything should be extensible is wrong. Object orientation (and all the philosophies connected to it) are a tool. They’re a very powerful tool, for instance, when we as API/SPI designers want to allow users to extend our software. (mostly through SPIs). And we spend a lot of time thinking about really good, generic, useful, powerful SPIs that solve 99% of all extensibility problems in a way that we can control and keep backwards compatible. For some examples, check out these blog posts:

And sometimes, yes, we did not foresee a justified request for extensibility. Nothing is perfect. You have a feature request, and you cannot implement it right away. Then you start exploring. You look into ways how you can inject some behaviour into jOOQ. And as we Java developers like object orientation, we’re looking into writing subclasses to override existing behaviour. That’s what we were taught. That’s what we’re doing all the time. That’s what the combination of the open-closed principle and the Liskov substitution principle suggest.

Let me shock you for a moment.

Haskell (and many other languages) doesn’t support subtype polymorphism

Yes. There are entire ecosystems out there, that don’t have the luxury of bikeshedding the fact that if a class cannot be (easily) extended through subtype polymorphism and overriding of methods, it must be ill-designed. An entire ecosystem that never worries about something being final, and thus “closed for extension” (through subtype polymorphism).

Alternative definitions

Given the historic context, both principles are very interesting things. But their object-oriented context is something we should free our minds of. Here’s a better definition:

  • open-closed principle:
    Systems should strive for openness for extension, but not at any price. Some parts of a system / module / perhaps class should be open for extension. Those parts should be very well designed and kept very backwards compatible. And the vendor of those parts should listen to its consumers to better identify the required extension points. Consumers on the other hand shouldn’t blindly assume that everything can be extended. If they’re extending (through unexpected subtype polymorphism) random parts, then they’re hacking in the same way as if they would be actually modifying the system / parts. There’s no more benefit to extending.
  • Liskov substitution principle:
    Subtype polymorphism is just a tool, and in 2017, we have long started understanding that it’s a very wrong tool for many things. The composition over inheritance concept has shown that we’ve regretted the subtype polymorphism hype from the 90s. So, forget about your mocks through subtype overriding. Start looking for alternative interpretations of this principle. I like Jessica Kerr’s finding:

    Therefore, the Liskov Substition Principle says, “Don’t surprise people.”

    That’s a much better credo to follow, than the one that is strictly related to an aspect of object orientation and in particular to subtype polymorphism.

Conclusion

Yes. Package private, final classes mean, you cannot extend them. The open-closed principle is “violated”. Because that part of the system was not designed for you to know about (it’s encapsulated).

Sometimes, you think that if just you could override such an entity, you might get a quick win and inject your desired behaviour into a third party library / entity / class / module / system. My claim here is that: Mostly, you’ll deeply regret your desire for a quick win later on. You shouldn’t argue about open-closed or Liskov substitution. These principles simply don’t apply here. They do not at all, in particular, apply to badly designed legacy software. Once software is “badly designed”, no principles will help you.

Instead, do get in touch with the vendor if you run into a bump. There’s always an interesting idea for a great new feature hidden in such a limitation. And for the time being, accept that your overriding of what was not meant to be overridden is just the same thing as actually modifying that entity. You’re patching the library. Let’s do that and move on.

Creating Tables Dum and Dee in PostgreSQL

I was nerd-sniped:

So tables dee and dum are two theoretical tables in SQL, and they can be characterised as such:

[Dee] is the relation that has no attributes and a single tuple. It plays the role of True.

[Dum] is the relation that has no attributes and no tuples. It plays the role of False.

Quite academic? Sure. But the awesome PostgreSQL database can model these beasts! Check this out:

-- Creating the tables:
CREATE TABLE dum();
CREATE TABLE dee();
INSERT INTO dee DEFAULT VALUES;

-- Making sure the tables stay this way:
CREATE FUNCTION dum_trg ()
RETURNS trigger
AS $$
BEGIN
  RAISE EXCEPTION 'Dum must be empty';
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER dum_trg
BEFORE INSERT OR UPDATE OR DELETE OR TRUNCATE
ON dum
FOR EACH STATEMENT
EXECUTE PROCEDURE dum_trg();

CREATE FUNCTION dee_trg ()
RETURNS trigger
AS $$
BEGIN
  RAISE EXCEPTION 'Dee must keep one tuple';
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER dee_trg
BEFORE INSERT OR UPDATE OR DELETE OR TRUNCATE
ON dee
FOR EACH STATEMENT
EXECUTE PROCEDURE dee_trg();

And we’re done!

Check this out:

SELECT * FROM dum;

Nothing!

SELECT * FROM dee;

One row with no columns!

Just to be sure:

SELECT 'dum' AS t, count(*) 
FROM dum 
UNION ALL
SELECT 'dee' AS t, count(*) 
FROM dee;

And we’ll get, nicely:

Caveat

Note, it is worth mentioning that there are some flaws / “bugs” (in my opinion). This query:

SELECT DISTINCT * FROM dee;

… yields an error:

ERROR: SELECT DISTINCT must have at least one column
SQL state: 42601

I suspect the author(s) of the DISTINCT operation have overlooked a nice feature here. UNION on the other hand doesn’t work correctly either. It doesn’t remove duplicates (but also doesn’t complain:

SELECT * FROM dee
UNION
SELECT * FROM dee

This yields:

, which is surprising, because when we nest the record with the following useful PostgreSQL specific syntax, we’ll get a single nested empty row:

SELECT dee FROM dee
UNION
SELECT dee FROM dee

When we use EXCEPT or INTERSECT, however, clearly, we’re running in a bug. Both of these queries return the same result:

SELECT * FROM dee
EXCEPT
SELECT * FROM dee

SELECT * FROM dee
INTERSECT
SELECT * FROM dee

The same result as the UNION:

In any case: Every database schema should have these. Much more powerful than Oracle’s DUAL table. With this, have a nice weekend!