Avoid Using COUNT() in SQL When You Could Use EXISTS()

Posted on September 14, 2016June 25, 2024 by lukaseder

A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:
https://blog.jooq.org/sql-tip-of-the-day-be-wary-of-select-count

… and how to replace them with equivalent EXISTS queries

As I’m updating the SQL training to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

The rationale is simple

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:

“Are there any rows at all?”

In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:

“Did actors called “Wahlberg” play in any films at all?”

Then you have two options to write that query:

Very very bad: Use COUNT(*)

Using PostgreSQL syntax:

SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.

Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with COUNT(*)

And in PostgreSQL:

Much much better: Use EXISTS()

Using PostgreSQL syntax:

SELECT EXISTS (
  SELECT 1 FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
)

The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with EXISTS()

And in PostgreSQL:

How to read this?

As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.

In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.

You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:

It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence

Duh.

Does this apply to me?

Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?

Collection<?> collection = ...

// EXISTS
if (!collection.isEmpty())
    doSomething();

// COUNT(*)
if (collection.size() == 0)
    doSomething();

Sometimes, this doesn’t really matter, e.g. in ArrayList, whose isEmpty() method reads:

public boolean isEmpty() {
    return size == 0;
}

But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.

Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™

OK, costs. But what does it mean?

Let’s benchmark these two statements in Oracle and PostgreSQL.

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT 1 FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

We get a slight but significant performance improvement of factor 1.3x:

Statement 1 : 3
Statement 2 : 4

(not actual times, because those can’t be published for commercial DBs). But you can check out the Sakila database yourself and run the above benchmark on your machine.

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT 1 FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

A whopping factor 40x in terms of wallclock time gain!

INFO:  Statement 1: 00:00:00.023656
INFO:  Statement 2: 00:00:00.7944

Let me repeat this:

Factor 40x on PostgreSQL

That’s something! It looks as though COUNT(*) is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.

Conclusion

I’m repeating myself, but this is important. Print it out and put it on your office wall:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

Thank you.

Published by lukaseder

I made jOOQ View all posts by lukaseder

23 thoughts on “Avoid Using COUNT() in SQL When You Could Use EXISTS()”

Maaartinus says:

September 14, 2016 at 19:06

Do the exercise with Hibernate

Should I? There’s
Subqueries.exists, but I don’t know how to use when I have already a plain Criteria instead of DetachedCriteria (rather painful, I know). I’m just asking on SO.

Do you want an exercise, too? In one place I used something like

getSingleQueryResult(“SELECT id FROM … LIMIT 1”) != null

instead of

getSingleQueryResult(“SELECT EXISTS (SELECT 1 FROM …)”)

It works as I know that the id can’t be null. The speed should be about the same, but is it? Fetching the single value from the PK shouldn’t matter.

Loading...

Reply
1. lukaseder says:
  
  September 15, 2016 at 21:06
  I should’ve been more clear. I meant when entities are navigated. Kind of like
```
actor.getFilms().isEmpty();
```
  I bet there are tons of applications out there that get this wrong, materialising the entire films relationship just to check for existence.
  
  Loading...
  Reply
  1. Maaartinus says:
    
    September 16, 2016 at 04:09
    
    I see. That’s indeed stupid. I could imagine using it somewhere where speed doesn’t matter at all (something to be run just once) out of laziness (imagine getFilms() uses a complicated query).
    
    In theory, Hibernate could optimize this easily. Return a list proxy and use EXISTS when isEmpty gets called. This would be wasteful, when the list content is needed later, but that’s something the optimizer could learn from previous queries (I guess I’m repeating myself; sort of JIT for ORM).
    
    Loading...
    
    Reply
    1. lukaseder says:
      
      September 16, 2016 at 08:08
      
      Speed always matters at some point. Perhaps only in production, though ;) a.k.a.: “Works on my machine”
      
      Hibernate already does that with the list proxy. But as far as I know, it calls COUNT(*) for all of these methods, including isEmpty(). But then again, the client will call isEmpty() for every actor in a client loop, and I doubt that JIT for ORM is a thing anytime soon, such that this client side loop will be translated to a JOIN…
      
      Loading...
      
      Reply
      1. Maaartinus says:
        
        September 19, 2016 at 12:22
        
        > Speed always matters at some point. Perhaps only in production, though😉
        
        :D Even in production, there are things which hardly matters. A small data fix I wrote last week is a good example: it doesn’t take long (we don’t have much data now), it’s needed just once and it’ll get dropped after a while.
        
        > JIT for ORM
        
        We may want to call it AI as it can get pretty complicated. Using EXISTS instead of COUNT is trivial, finding out that there are many DB calls executing the same query with different parameters is easy, replacing them all with a single call is doable. Getting it all work well is probably hard, but JVM JIT isn’t simple either.
        
        Loading...
        
        Reply
        
        lukaseder says:
        
        September 19, 2016 at 15:17
        
        I agree. On both. :)
        
        Loading...
        
        Reply
nikosvg says:

September 14, 2016 at 20:57

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like: “Are there any rows at all?”

In other words EXISTS short circuits and returns true on the first occurrence of a match

Loading...

Reply
1. lukaseder says:
  
  September 15, 2016 at 21:08
  
  Exactly. Thanks for the feedback. Will edit the article to clarify this.
  
  Loading...
  
  Reply
bebop says:

September 15, 2016 at 09:26

what about mysql?

Loading...

Reply
1. lukaseder says:
  
  September 15, 2016 at 21:10
  
  This is also true for MySQL
  
  Loading...
  
  Reply
krab says:

September 25, 2016 at 00:27

what about “select … limit 1” ?

Loading...

Reply
1. lukaseder says:
  
  September 25, 2016 at 21:28
  
  It depends. If your database supports LIMIT without ORDER BY, then this might be as good as EXISTS. But if ORDER BY is required, a LIMIT solution is at best as good as COUNT(*) and perhaps even worse.
  
  In any case: Measure!
  
  Loading...
  
  Reply
  1. Maaartinus says:
    
    September 25, 2016 at 22:41
    Why do some databases require ORDER BY with LIMIT? Do they refuse to return an essentially random subset of the full result? That sound naive, as (I hope that) they don’t require the order to be total.
    
    Another thing: EXISTS could be also a grouping function like in
    
    SELECT town, EXISTS(age > 100) FROM someTable GROUP BY town
    
    couldn’t it? This would probably make the misuses of COUNT less common.
    
    Loading...
    Reply
    1. lukaseder says:
      
      September 25, 2016 at 23:09
      
      Yes, unfortunately they do. E.g. SQL Server allows OFFSET .. FETCH only with ORDER BY (although TOP is possible without ORDER BY). I’m pretty sure Oracle does optimisations for ROWNUM = 1 filters, where only the “max” row needs to be fetched (in O(N)) rather than actually performing the entire sort (in O(N log N)). Or, if no ORDER BY clause is present, then ROWNUM = 1 will yield a random row as well.
      
      PostgreSQL has the bool_or() aggregate function, which I’m sure is short-circuited as soon as the first evaluation of the argument expression yields true.
      
      Loading...
      
      Reply
    2. lukaseder says:
      
      September 26, 2016 at 08:02
      
      Let’s find out for PostgreSQL: https://stackoverflow.com/q/39695647/521799
      
      Loading...
      
      Reply
Robert says:

March 13, 2017 at 19:05

And how to write the SELECT EXIST(…) statement in JOOQ?

Loading...

Reply
1. lukaseder says:
  
  March 13, 2017 at 22:45
  You can use DSL.field(Condition) to wrap a predicate in a boolean column expression:
  https://www.jooq.org/javadoc/latest/org/jooq/impl/DSL.html#field-org.jooq.Condition-
  
  Example:
```
select(field(exists(...)))
```
  A shortcut might be DSLContext.fetchExists(Select), the convenience API:
  https://www.jooq.org/javadoc/latest/org/jooq/DSLContext.html#fetchExists-org.jooq.Select-
  
  Loading...
  Reply
erdiizgi says:

September 12, 2019 at 09:08

You can have
SELECT count(1) …
WHERE …
AND rownum = 1;

this will prevent full table scan.

Loading...

Reply
1. lukaseder says:
  
  September 12, 2019 at 09:10
  
  Yes, I’ve seen that practice occasionally, but I find that really weird and unreadable. But indeed, I haven’t found a drawback compared to the EXISTS() approach.
  
  Loading...
  
  Reply
  1. erdiizgi says:
    
    September 12, 2019 at 09:34
    
    EXISTS is far more meaningful as a word, I agree on this, but if you are using Oracle, there are drawbacks such as you have to use CASE WHEN EXIST with another SELECT statement. You should also specify THEN .. ELSE .. FROM DUAL.
    
    To me, this is really long statement to read to see if a row exists in a table.
    
    For other DBs, EXIST looks more readable, but I would disagree for the Oracle.
    
    Loading...
    
    Reply
    1. lukaseder says:
      
      September 12, 2019 at 09:40
      
      Well, feel free to upvote these feature requests :)
      
      https://community.oracle.com/ideas/13697 (Support for the SQL standard VALUES() table constructor)
      https://community.oracle.com/ideas/2633 (Support for the BOOLEAN type in SQL)
      
      Loading...
      
      Reply
Adrian Parker says:

September 4, 2020 at 15:38

But does exists short circuit the result set? Should we instead use … EXISTS (SELECT top 1 1 FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = ‘WAHLBERG’)

Loading...

Reply
1. lukaseder says:
  
  September 4, 2020 at 16:45
  
  According to my benchmarks, yes it does. Try it with TOP 1 on your RDBMS, it shouldn’t make a difference.
  
  Loading...
  
  Reply

Avoid Using COUNT() in SQL When You Could Use EXISTS()

The rationale is simple

How to read this?

Does this apply to me?

OK, costs. But what does it mean?

Conclusion

Like this:

Published by lukaseder

23 thoughts on “Avoid Using COUNT() in SQL When You Could Use EXISTS()”

Leave a Reply to lukasederCancel reply

The rationale is simple

How to read this?

Does this apply to me?

OK, costs. But what does it mean?

Conclusion

Like this:

Published by lukaseder

23 thoughts on “Avoid Using COUNT() in SQL When You Could Use EXISTS()”

Leave a Reply to lukasederCancel reply

Discover more from Java, SQL and jOOQ.