Avoid Using COUNT() in SQL When You Could Use EXISTS()

A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:
https://blog.jooq.org/2014/08/08/sql-tip-of-the-day-be-wary-of-select-count

… and how to replace them with equivalent EXISTS queries

exist

As I’m updating the jOOQ SQL Masterclass to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

The rationale is simple

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:

  “Are there any rows at all?”

In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:

  “Did actors called “Wahlberg” play in any films at all?”

Then you have two options to write that query:

Very very bad: Use COUNT(*)

Using PostgreSQL syntax:

SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.

Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with COUNT(*)

And in PostgreSQL:

mqlgukh1

Much much better: Use EXISTS()

Using PostgreSQL syntax:

SELECT EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
)

The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with EXISTS()

And in PostgreSQL:

plwnqga1

How to read this?

As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.

In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.

You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:

It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence

Duh.

Does this apply to me?

Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?

Collection<?> collection = ...

// EXISTS
if (!collection.isEmpty())
    doSomething();

// COUNT(*)
if (collection.size() == 0)
    doSomething();

Sometimes, this doesn’t really matter, e.g. in ArrayList, whose isEmpty() method reads:

public boolean isEmpty() {
    return size == 0;
}

But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.

Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™

OK, costs. But what does it mean?

Let’s benchmark these two statements in Oracle and PostgreSQL.

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

We get a slight but significant performance improvement of factor 1.3x:

Statement 1 : 3
Statement 2 : 4

(not actual times, because thank you Oracle legal for prohibiting all sorts of stuff). But you can check out the Sakila database yourself and run the above benchmark on your machine.

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

A whopping factor 40x in terms of wallclock time gain!

INFO:  Statement 1: 00:00:00.023656
INFO:  Statement 2: 00:00:00.7944

Let me repeat this:

Factor 40x on PostgreSQL

That’s something! It looks as though COUNT(*) is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.

Conclusion

I’m repeating myself, but this is important. Print it out and put it on your office wall:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

Thank you.

17 thoughts on “Avoid Using COUNT() in SQL When You Could Use EXISTS()

    • I should’ve been more clear. I meant when entities are navigated. Kind of like

      actor.getFilms().isEmpty();
      

      I bet there are tons of applications out there that get this wrong, materialising the entire films relationship just to check for existence.

      • I see. That’s indeed stupid. I could imagine using it somewhere where speed doesn’t matter at all (something to be run just once) out of laziness (imagine getFilms() uses a complicated query).

        In theory, Hibernate could optimize this easily. Return a list proxy and use EXISTS when isEmpty gets called. This would be wasteful, when the list content is needed later, but that’s something the optimizer could learn from previous queries (I guess I’m repeating myself; sort of JIT for ORM).

        • Speed always matters at some point. Perhaps only in production, though 😉 a.k.a.: “Works on my machine”

          Hibernate already does that with the list proxy. But as far as I know, it calls COUNT(*) for all of these methods, including isEmpty(). But then again, the client will call isEmpty() for every actor in a client loop, and I doubt that JIT for ORM is a thing anytime soon, such that this client side loop will be translated to a JOIN…

          • > Speed always matters at some point. Perhaps only in production, though😉

            😀 Even in production, there are things which hardly matters. A small data fix I wrote last week is a good example: it doesn’t take long (we don’t have much data now), it’s needed just once and it’ll get dropped after a while.

            > JIT for ORM

            We may want to call it AI as it can get pretty complicated. Using EXISTS instead of COUNT is trivial, finding out that there are many DB calls executing the same query with different parameters is easy, replacing them all with a single call is doable. Getting it all work well is probably hard, but JVM JIT isn’t simple either.

  1. COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like: “Are there any rows at all?”

    In other words EXISTS short circuits and returns true on the first occurrence of a match

    • It depends. If your database supports LIMIT without ORDER BY, then this might be as good as EXISTS. But if ORDER BY is required, a LIMIT solution is at best as good as COUNT(*) and perhaps even worse.

      In any case: Measure!

      • Why do some databases require ORDER BY with LIMIT? Do they refuse to return an essentially random subset of the full result? That sound naive, as (I hope that) they don’t require the order to be total.

        Another thing: EXISTS could be also a grouping function like in

        SELECT town, EXISTS(age > 100) FROM someTable GROUP BY town
        

        couldn’t it? This would probably make the misuses of COUNT less common.

        • Yes, unfortunately they do. E.g. SQL Server allows OFFSET .. FETCH only with ORDER BY (although TOP is possible without ORDER BY). I’m pretty sure Oracle does optimisations for ROWNUM = 1 filters, where only the “max” row needs to be fetched (in O(N)) rather than actually performing the entire sort (in O(N log N)). Or, if no ORDER BY clause is present, then ROWNUM = 1 will yield a random row as well.

          PostgreSQL has the bool_or() aggregate function, which I’m sure is short-circuited as soon as the first evaluation of the argument expression yields true.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s