A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:
https://blog.jooq.org/sql-tip-of-the-day-be-wary-of-select-count
… and how to replace them with equivalent EXISTS queries

As I’m updating the SQL training to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:
Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs
The rationale is simple
COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:
“Are there any rows at all?”
In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:
“Did actors called “Wahlberg” play in any films at all?”
Then you have two options to write that query:
Very very bad: Use COUNT(*)
Using PostgreSQL syntax:
SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.
Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

And in PostgreSQL:

Much much better: Use EXISTS()
Using PostgreSQL syntax:
SELECT EXISTS (
SELECT 1 FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
)
The execution plans for the above query in Oracle:

And in PostgreSQL:

How to read this?
As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.
In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.
You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:
It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence
Duh.
Does this apply to me?
Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?
Collection<?> collection = ...
// EXISTS
if (!collection.isEmpty())
doSomething();
// COUNT(*)
if (collection.size() == 0)
doSomething();
Sometimes, this doesn’t really matter, e.g. in ArrayList
, whose isEmpty()
method reads:
public boolean isEmpty() {
return size == 0;
}
But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.
Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™
OK, costs. But what does it mean?
Let’s benchmark these two statements in Oracle and PostgreSQL.
Oracle
SET SERVEROUTPUT ON
DECLARE
v_ts TIMESTAMP WITH TIME ZONE;
v_repeat CONSTANT NUMBER := 10000;
BEGIN
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT CASE WHEN EXISTS (
SELECT 1 FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
) THEN 1 ELSE 0 END
FROM dual
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/
We get a slight but significant performance improvement of factor 1.3x:
Statement 1 : 3 Statement 2 : 4
(not actual times, because those can’t be published for commercial DBs). But you can check out the Sakila database yourself and run the above benchmark on your machine.
PostgreSQL
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 1000;
rec RECORD;
BEGIN
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT CASE WHEN EXISTS (
SELECT 1 FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
) THEN 1 ELSE 0 END
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts);
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts);
END$$;
A whopping factor 40x in terms of wallclock time gain!
INFO: Statement 1: 00:00:00.023656 INFO: Statement 2: 00:00:00.7944
Let me repeat this:
Factor 40x on PostgreSQL
That’s something! It looks as though COUNT(*)
is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.
Conclusion
I’m repeating myself, but this is important. Print it out and put it on your office wall:
Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs
Thank you.
Should I? There’s
Subqueries.exists, but I don’t know how to use when I have already a plain Criteria instead of DetachedCriteria (rather painful, I know). I’m just asking on SO.
Do you want an exercise, too? In one place I used something like
getSingleQueryResult(“SELECT id FROM … LIMIT 1”) != null
instead of
getSingleQueryResult(“SELECT EXISTS (SELECT 1 FROM …)”)
It works as I know that the id can’t be null. The speed should be about the same, but is it? Fetching the single value from the PK shouldn’t matter.
I should’ve been more clear. I meant when entities are navigated. Kind of like
I bet there are tons of applications out there that get this wrong, materialising the entire films relationship just to check for existence.
I see. That’s indeed stupid. I could imagine using it somewhere where speed doesn’t matter at all (something to be run just once) out of laziness (imagine getFilms() uses a complicated query).
In theory, Hibernate could optimize this easily. Return a list proxy and use EXISTS when isEmpty gets called. This would be wasteful, when the list content is needed later, but that’s something the optimizer could learn from previous queries (I guess I’m repeating myself; sort of JIT for ORM).
Speed always matters at some point. Perhaps only in production, though ;) a.k.a.: “Works on my machine”
Hibernate already does that with the list proxy. But as far as I know, it calls COUNT(*) for all of these methods, including
isEmpty()
. But then again, the client will callisEmpty()
for every actor in a client loop, and I doubt that JIT for ORM is a thing anytime soon, such that this client side loop will be translated to a JOIN…> Speed always matters at some point. Perhaps only in production, though😉
:D Even in production, there are things which hardly matters. A small data fix I wrote last week is a good example: it doesn’t take long (we don’t have much data now), it’s needed just once and it’ll get dropped after a while.
> JIT for ORM
We may want to call it AI as it can get pretty complicated. Using EXISTS instead of COUNT is trivial, finding out that there are many DB calls executing the same query with different parameters is easy, replacing them all with a single call is doable. Getting it all work well is probably hard, but JVM JIT isn’t simple either.
I agree. On both. :)
In other words EXISTS short circuits and returns true on the first occurrence of a match
Exactly. Thanks for the feedback. Will edit the article to clarify this.
what about mysql?
This is also true for MySQL
what about “select … limit 1” ?
It depends. If your database supports LIMIT without ORDER BY, then this might be as good as EXISTS. But if ORDER BY is required, a LIMIT solution is at best as good as COUNT(*) and perhaps even worse.
In any case: Measure!
Why do some databases require ORDER BY with LIMIT? Do they refuse to return an essentially random subset of the full result? That sound naive, as (I hope that) they don’t require the order to be total.
Another thing: EXISTS could be also a grouping function like in
couldn’t it? This would probably make the misuses of COUNT less common.
Yes, unfortunately they do. E.g. SQL Server allows OFFSET .. FETCH only with ORDER BY (although TOP is possible without ORDER BY). I’m pretty sure Oracle does optimisations for ROWNUM = 1 filters, where only the “max” row needs to be fetched (in O(N)) rather than actually performing the entire sort (in O(N log N)). Or, if no ORDER BY clause is present, then ROWNUM = 1 will yield a random row as well.
PostgreSQL has the
bool_or()
aggregate function, which I’m sure is short-circuited as soon as the first evaluation of the argument expression yields true.Let’s find out for PostgreSQL: https://stackoverflow.com/q/39695647/521799
And how to write the SELECT EXIST(…) statement in JOOQ?
You can use
DSL.field(Condition)
to wrap a predicate in a boolean column expression:https://www.jooq.org/javadoc/latest/org/jooq/impl/DSL.html#field-org.jooq.Condition-
Example:
A shortcut might be
DSLContext.fetchExists(Select)
, the convenience API:https://www.jooq.org/javadoc/latest/org/jooq/DSLContext.html#fetchExists-org.jooq.Select-
You can have
SELECT count(1) …
WHERE …
AND rownum = 1;
this will prevent full table scan.
Yes, I’ve seen that practice occasionally, but I find that really weird and unreadable. But indeed, I haven’t found a drawback compared to the EXISTS() approach.
EXISTS is far more meaningful as a word, I agree on this, but if you are using Oracle, there are drawbacks such as you have to use CASE WHEN EXIST with another SELECT statement. You should also specify THEN .. ELSE .. FROM DUAL.
To me, this is really long statement to read to see if a row exists in a table.
For other DBs, EXIST looks more readable, but I would disagree for the Oracle.
Well, feel free to upvote these feature requests :)
https://community.oracle.com/ideas/13697 (Support for the SQL standard VALUES() table constructor)
https://community.oracle.com/ideas/2633 (Support for the BOOLEAN type in SQL)
But does exists short circuit the result set? Should we instead use … EXISTS (SELECT top 1 1 FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = ‘WAHLBERG’)
According to my benchmarks, yes it does. Try it with TOP 1 on your RDBMS, it shouldn’t make a difference.