In a recent blog post, I’ve advocated against the use of COUNT(*) in SQL, when a simple EXISTS() would suffice. This is important stuff. I keep tuning productive queries where a customer runs a COUNT(*) query like so:
SELECT count(*) INTO v_any_wahlbergs FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG'
… where after they discard the exact count to only check for existence:
IF v_any_wahlbergs = 0 THEN something(); ELSE something_else(); END IF;
It doesn’t matter if the client logic is written in PL/SQL (as above), or in any other language like Java, the overhead is significant compared to the following, much simpler EXISTS()
query:
SELECT CASE WHEN EXISTS ( SELECT * FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG' ) THEN 1 ELSE 0 END INTO v_any_wahlbergs FROM dual
Clearly, you can see the effect is the same, and the database can optimise the existence check by taking a short cut once it has found at least one result (instead of going through the entire result set to get the exact count, which wasn’t needed in the first place).
This is true also for primary key checks
The above is obvious. But then, during a recent execution of my SQL Masterclass training, one of the delegates asked me a very interesting question.
Is this also true for primary key checks?
Now, I personally always prefer the EXISTS()
syntax, because it clearly communicates that I’m after an existence check, not an actual count query. But in principle, the following two queries are exactly the same:
-- Can be 0 or 1 SELECT count(*) FROM film WHERE film_id = 1; -- Can also be 0 or 1 SELECT CASE WHEN EXISTS ( SELECT * FROM film WHERE film_id = 1 ) THEN 1 ELSE 0 END FROM dual;
I always say: Never guess, always measure.
Comparing techniques in Oracle
Here’s a benchmark in Oracle 11g XE running against the Sakila database:
SET SERVEROUTPUT ON DECLARE v_ts TIMESTAMP; v_repeat CONSTANT NUMBER := 1000000; BEGIN v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT count(*) FROM film WHERE film_id = 1 ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts)); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT CASE WHEN EXISTS ( SELECT * FROM film WHERE film_id = 1 ) THEN 1 ELSE 0 END FROM dual ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts)); END; /
The result is (only qualitative numbers, not actual seconds because benchmark results comparing Oracle with other DBs must not be published in Oracle :( )
Statement 1 : 0.0000012 slurbs Statement 2 : 0.0000011 slurbs
As you can see, the EXISTS()
query still slightly outperforms the COUNT(*)
query in Oracle.
Comparing techniques in PostgreSQL
Let’s repeat the same in PostgreSQL 9.5:
DO $$ DECLARE v_ts TIMESTAMP; v_repeat CONSTANT INT := 1000000; rec RECORD; BEGIN v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT count(*) FROM film WHERE film_id = 1 ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT EXISTS ( SELECT * FROM film WHERE film_id = 1 ) ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); END$$;
This time, with actual seconds because yay, Open Source can be benchmarked freely:
INFO: Statement 1: 00:00:13.32556 INFO: Statement 2: 00:00:08.350491
Oh wow! That’s a massive improvement in PostgreSQL! OK, we know that COUNT(*)
is slow in PostgreSQL. Here’s tons of excuses why.
Comparing techniques in SQL Server
Let’s repeat on SQL Server 2014. And please, observe the beautiful procedural language called T-SQL, which doesn’t even support implicit cursor loops as Oracle and PostgreSQL:
USE sakila DECLARE @ts DATETIME; DECLARE @repeat INT = 1000000; DECLARE @i INT; DECLARE @dummy VARCHAR; DECLARE @s1 CURSOR; DECLARE @s2 CURSOR; SET @s1 = CURSOR FOR SELECT count(*) FROM film WHERE film_id = 1; SET @s2 = CURSOR FOR SELECT CASE WHEN EXISTS ( SELECT * FROM film WHERE film_id = 1 ) THEN 1 ELSE 0 END; SET @ts = current_timestamp; SET @i = 0; WHILE @i < @repeat BEGIN SET @i = @i + 1 OPEN @s1; FETCH NEXT FROM @s1 INTO @dummy; WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM @s1 INTO @dummy; END; CLOSE @s1; END; DEALLOCATE @s1; PRINT 'Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms'; SET @ts = current_timestamp; SET @i = 0; WHILE @i < @repeat BEGIN SET @i = @i + 1 OPEN @s2; FETCH NEXT FROM @s2 INTO @dummy; WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM @s2 INTO @dummy; END; CLOSE @s2; END; DEALLOCATE @s2; PRINT 'Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';
Again, I must not publish actual results because benchmarks results must not be published in SQL Server either, so let’s anonymise by introducing another new unit of measurement:
Statement 1: OVER 9000 Statement 2: OVER 8999
Oh excellent! No measurable difference in SQL Server, this time!
Conclusion
While the difference between COUNT(*)
and EXISTS()
queries is drastic for ordinary queries that may possibly return thousands of rows (and thus counts of > 1000), the difference for primary key checks is:
- Marginal but still worth improving in Oracle
- Significant in PostgreSQL
- Non existent in SQL Server
So, there’s nothing wrong with consistently applying EXISTS()
in all of your queries.
If you’re using jOOQ, getting it right is even easier. Just run:
boolean exists = ctx.fetchExists( select() .from(ACTOR) .join(FILM_ACTOR).using(ACTOR.ACTOR_ID) .where(ACTOR.LAST_NAME.eq("WAHLBERG")) );
And jOOQ will wrap your query in that EXISTS()
block for you.