Don’t Even use COUNT(*) For Primary Key Existence Checks

In a recent blog post, I’ve advocated against the use of COUNT(*) in SQL, when a simple EXISTS() would suffice. This is important stuff. I keep tuning productive queries where a customer runs a COUNT(*) query like so:

SELECT count(*)
INTO v_any_wahlbergs
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

… where after they discard the exact count to only check for existence:

IF v_any_wahlbergs = 0 THEN
  something();
ELSE
  something_else();
END IF;

It doesn’t matter if the client logic is written in PL/SQL (as above), or in any other language like Java, the overhead is significant compared to the following, much simpler EXISTS() query:

SELECT CASE WHEN EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
) THEN 1 ELSE 0 END
INTO v_any_wahlbergs
FROM dual

Clearly, you can see the effect is the same, and the database can optimise the existence check by taking a short cut once it has found at least one result (instead of going through the entire result set to get the exact count, which wasn’t needed in the first place).

This is true also for primary key checks

The above is obvious. But then, during a recent execution of my SQL Masterclass training, one of the delegates asked me a very interesting question.
Is this also true for primary key checks?
Now, I personally always prefer the EXISTS() syntax, because it clearly communicates that I’m after an existence check, not an actual count query. But in principle, the following two queries are exactly the same:

-- Can be 0 or 1
SELECT count(*) FROM film WHERE film_id = 1;

-- Can also be 0 or 1
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END
FROM dual;

I always say: Never guess, always measure.

Comparing techniques in Oracle

Here’s a benchmark in Oracle 11g XE running against the Sakila database:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM film WHERE film_id = 1
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The result is (only qualitative numbers, not actual seconds because benchmark results comparing Oracle with other DBs must not be published in Oracle :( )
Statement 1 : 0.0000012 slurbs
Statement 2 : 0.0000011 slurbs
As you can see, the EXISTS() query still slightly outperforms the COUNT(*) query in Oracle.

Comparing techniques in PostgreSQL

Let’s repeat the same in PostgreSQL 9.5:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT EXISTS (
        SELECT * FROM film WHERE film_id = 1
      )
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

This time, with actual seconds because yay, Open Source can be benchmarked freely:
INFO:  Statement 1: 00:00:13.32556
INFO:  Statement 2: 00:00:08.350491
Oh wow! That’s a massive improvement in PostgreSQL! OK, we know that COUNT(*) is slow in PostgreSQL. Here’s tons of excuses why.

Comparing techniques in SQL Server

Let’s repeat on SQL Server 2014. And please, observe the beautiful procedural language called T-SQL, which doesn’t even support implicit cursor loops as Oracle and PostgreSQL:

USE sakila
DECLARE @ts DATETIME;
DECLARE @repeat INT = 1000000;
DECLARE @i INT;
DECLARE @dummy VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @s1 = CURSOR FOR 
SELECT count(*) FROM film WHERE film_id = 1;

SET @s2 = CURSOR FOR 
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END;


SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s1;
  FETCH NEXT FROM @s1 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s1 INTO @dummy;
  END;

  CLOSE @s1;
END;

DEALLOCATE @s1;
PRINT 'Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s2;
  FETCH NEXT FROM @s2 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s2 INTO @dummy;
  END;

  CLOSE @s2;
END;

DEALLOCATE @s2;
PRINT 'Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

Again, I must not publish actual results because benchmarks results must not be published in SQL Server either, so let’s anonymise by introducing another new unit of measurement:
Statement 1: OVER 9000
Statement 2: OVER 8999
Oh excellent! No measurable difference in SQL Server, this time!

Conclusion

While the difference between COUNT(*) and EXISTS() queries is drastic for ordinary queries that may possibly return thousands of rows (and thus counts of > 1000), the difference for primary key checks is:
  • Marginal but still worth improving in Oracle
  • Significant in PostgreSQL
  • Non existent in SQL Server
So, there’s nothing wrong with consistently applying EXISTS() in all of your queries. If you’re using jOOQ, getting it right is even easier. Just run:

boolean exists = ctx.fetchExists(
  select()
  .from(ACTOR)
  .join(FILM_ACTOR).using(ACTOR.ACTOR_ID)
  .where(ACTOR.LAST_NAME.eq("WAHLBERG"))
);

And jOOQ will wrap your query in that EXISTS() block for you.

Leave a Reply