One of the biggest and undead myths in SQL is that COUNT(*) is faster than COUNT(1). Or was it that COUNT(1) is faster than COUNT(*)? Impossible to remember, because there’s really no reason at all why one should be faster than the other. But is the myth justified?
Let’s measure!
How does COUNT(…) work?
But first, let’s look into some theory. The two ways to count things are not exactly the same thing. Why?
COUNT(*)
counts all the tuples in a groupCOUNT(<expr>)
counts all the tuples in a group for which<expr>
evaluates to something thatIS NOT NULL
This distinction can be quite useful. Most of the time, we’ll simply COUNT(*)
for convenience, but there are (at least) two cases where we don’t want that, for example:
When outer joining
Imagine that in the Sakila database, we have some actors that did not play in any films. Making sure such an actor actually exists:
INSERT INTO actor (actor_id, first_name, last_name)
VALUES (201, 'SUSAN', 'DAVIS');
When inner joining, we might write the following (using PostgreSQL syntax):
SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;
And we won’t get the newly added SUSAN DAVIS, because of the nature of inner join:
actor_id|first_name |last_name |c | --------|-----------|------------|--| 148|EMILY |DEE |14| 35|JUDY |DEAN |15| 199|JULIA |FAWCETT |15| 186|JULIA |ZELLWEGER |16| 31|SISSY |SOBIESKI |18| 71|ADAM |GRANT |18| 1|PENELOPE |GUINESS |19| 30|SANDRA |PECK |19|
So we might change our query to use LEFT JOIN
instead
SELECT actor_id, a.first_name, a.last_name, count(*) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;
There she is now, but oops, wrong count! She doesn’t have any films, which we have proven before with the INNER JOIN
query. Yet we get 1
:
actor_id|first_name |last_name |c | --------|-----------|------------|--| 201|SUSAN |DAVIS | 1| 148|EMILY |DEE |14| 35|JUDY |DEAN |15| 199|JULIA |FAWCETT |15| 186|JULIA |ZELLWEGER |16| 31|SISSY |SOBIESKI |18| 71|ADAM |GRANT |18| 1|PENELOPE |GUINESS |19| 30|SANDRA |PECK |19|
Her COUNT(*)
value is 1, because we do get 1 film tuple for her in the group, with all columns being NULL
. The solution is to count the FILM_ID
instead, which cannot be NULL
in the table (being a primary key), but only because of the LEFT JOIN
:
SELECT actor_id, a.first_name, a.last_name, count(film_id) AS c
FROM actor AS a
LEFT JOIN film_actor AS fa USING (actor_id)
LEFT JOIN film AS f USING (film_id)
GROUP BY actor_id
ORDER BY c ASC, actor_id ASC;
Notice, we could count other things than the primary key, but with the primary key, we’re quite certain we don’t get any other “accidental” nulls in our groups, which we did not want to exclude from the count value.
Now, we’re getting the correct result:
actor_id|first_name |last_name |c | --------|-----------|------------|--| 201|SUSAN |DAVIS | 0| 148|EMILY |DEE |14| 35|JUDY |DEAN |15| 199|JULIA |FAWCETT |15| 186|JULIA |ZELLWEGER |16| 31|SISSY |SOBIESKI |18| 71|ADAM |GRANT |18| 1|PENELOPE |GUINESS |19| 30|SANDRA |PECK |19|
When counting subsets of a group
An even more powerful application of counting only non-null evaluations of an expression is counting only subsets of a group. We’ve already blogged about this technique in our previous post about aggregating several expressions in one single query.
For example, counting in a single query:
- All actors
- Actors with their first_name starting with A
- Actors with their first_name ending with A
- Actors with their first_name containing A
In SQL:
SELECT
count(*),
count(CASE WHEN first_name LIKE 'A%' THEN 1 END),
count(CASE WHEN first_name LIKE '%A' THEN 1 END),
count(CASE WHEN first_name LIKE '%A%' THEN 1 END)
FROM actor;
This yields:
count|count|count|count| -----|-----|-----|-----| 201| 13| 30| 105|
This is very useful when pivoting data sets (see also Oracle/SQL Server PIVOT clause).
Notice that PostgreSQL supports the SQL standard FILTER
clause for this, which is more convenient and more readable. The above query can be written like this, in PostgreSQL:
SELECT
count(*),
count(*) FILTER (WHERE first_name LIKE 'A%'),
count(*) FILTER (WHERE first_name LIKE '%A'),
count(*) FILTER (WHERE first_name LIKE '%A%')
FROM actor;
Back to COUNT(*) vs COUNT(1)
Now that we know the theory behind these COUNT
expressions, what’s the difference between COUNT(*)
and COUNT(1)
. There is none, effectively. The 1
expression in COUNT(1)
evaluates a constant expression for each row in the group, and it can be proven that this constant expression will never evaluate to NULL
, so effectively, we’re running COUNT(*)
, counting ALL the rows in the group again.
There should be no difference, and parsers / optimisers should be able to recognise this and not do the extra work of checking every expression evaluation for NULL
-ness.
I recently saw this discussion on Twitter, though, where Vik Fearing looked up the PostgreSQL sources, showing that PostgreSQL does do the extra work instead of optimising this:
So, I was curious to see if it mattered. I ran a benchmark on the 4 most popular RDBMS, with these results:
- MySQL: Doesn’t matter. Sometimes
COUNT(1)
was faster, sometimesCOUNT(*)
was faster, so all differences were only benchmark artifacts - Oracle: Doesn’t matter. Like MySQL
- PostgreSQL: Does matter (!).
COUNT(*)
was consistently faster by around 10% on 1M rows, that’s much more than I had expected - SQL Server: Doesn’t matter. Like MySQL
The benchmark code can be found in the following gists:
The results are below. Each benchmark run repeated SELECT COUNT(*) FROM t
or SELECT COUNT(1) FROM t
100 times on a 1M row table, and then the benchmark was repeated 5 times to mitigate any warmup penalties and be fair with respect to caching.
The times displayed are relative to the fastest run per database product. This removes any distraction that may be caused by interpreting actual execution times as we do not want to compare database products against each other.
The database versions I’ve used are:
- MySQL 8.0.16 (in Docker)
- Oracle 18c XE (in Docker)
- PostgreSQL 11.3 (in Docker)
- SQL Server 2017 Express (in Windows)
MySQL
No relevant difference, nor a clear winner:
RUN STMT RELATIVE_TIME ----------------------------- 0 1 1.0079 0 2 1.0212 1 1 1.0229 1 2 1.0256 2 1 1.0009 2 2 1.0031 3 1 1.0291 3 2 1.0256 4 1 1.0618 4 2 1.0000
Oracle
No relevant difference, nor a clear winner
Run 1, Statement 1 : 1.06874 Run 1, Statement 2 : 1.01982 Run 2, Statement 1 : 1.09175 Run 2, Statement 2 : 1.0301 Run 3, Statement 1 : 1.00308 Run 3, Statement 2 : 1.02499 Run 4, Statement 1 : 1.02503 Run 4, Statement 2 : 1 Run 5, Statement 1 : 1.01259 Run 5, Statement 2 : 1.05828
PostgreSQL
A significant, consistent difference of almost 10%:
RUN 1, Statement 1: 1.00134 RUN 1, Statement 2: 1.09538 RUN 2, Statement 1: 1.00190 RUN 2, Statement 2: 1.09115 RUN 3, Statement 1: 1.00000 RUN 3, Statement 2: 1.09858 RUN 4, Statement 1: 1.00266 RUN 4, Statement 2: 1.09260 RUN 5, Statement 1: 1.00454 RUN 5, Statement 2: 1.09694
Again, I’m surprised by the order of magnitude of this difference. I would have expected it to be less. Curious to hear about your own results in the comments, or further ideas why this is so significant in PostgreSQL.
SQL Server
No relevant difference, nor a clear winner
Run 1, Statement 1: 1.00442 Run 1, Statement 2: 1.00702 Run 2, Statement 1: 1.00468 Run 2, Statement 2: 1.00000 Run 3, Statement 1: 1.00208 Run 3, Statement 2: 1.00624 Run 4, Statement 1: 1.00780 Run 4, Statement 2: 1.00364 Run 5, Statement 1: 1.00468 Run 5, Statement 2: 1.00702
Conclusion
As it is now in 2019, given the database versions mentioned above, unfortunately, there is a significant difference between COUNT(*)
and COUNT(1)
in PostgreSQL. Luckily (and this is rare in SQL), all the other dialects don’t care and thus, consistently using COUNT(*)
, rather than COUNT(1)
is a slightly better choice for ALL measured database products from this article.
Do note that the benchmark only tried a very simple query! The results might be different when using joins, unions, or any other SQL constructs, or in other edge cases, e.g. when using COUNT()
in HAVING
or ORDER BY
or with window functions, etc.
In any case, there shouldn’t be any difference, and I’m sure that a future PostgreSQL version will optimise the constant expression in the COUNT(<expr>)
aggregate function directly in the parser to avoid the extra work.
For other interesting optimisations that do not depend on the cost model, see this article here.
This is what I get on Postgres 11.5, Windows 10 Laptop with a SSD:
RUN 1, Statement 1: 1.06690
RUN 1, Statement 2: 1.24424
RUN 2, Statement 1: 1.00000
RUN 2, Statement 2: 1.16920
RUN 3, Statement 1: 1.13591
RUN 3, Statement 2: 1.28373
RUN 4, Statement 1: 1.01390
RUN 4, Statement 2: 1.26324
RUN 5, Statement 1: 1.08433
RUN 5, Statement 2: 1.18853
Postgres was using a Parallel Seq Scan with 2 workers
Thanks for the confirmation! So, also significant on your machine.
And what happens if you use count(75697897) or even count(‘dummy’) instead of count(1) ?
(i.e. are the “smarter” RDMS really capable of detecting that the expression is not nullable in an efficient way or have they “hardcoded” the count(1) case ?)
Worth a try ;)
I tried counting the ‘oracle is so cool’ expression.
MySQL: Not impressed
Oracle: Not impressed
PostgreSQL: Same result
SQL Server: Not impressed
I use count(*) all over the place and never really considered if there was a performance hit. ill test it versus Oracle and hsql (use it for embedded database) and see if it makes much difference for me also!
Thanks for the feedback. Will be very curious to learn how HSQL performs here. I suspect it will be like PostgreSQL, without any optimisations on COUNT(1)
Not sure if the mysql test case is relevant in this case. Using a cursor forces the use of a temporary table and this has a huge penalty in performance compared with the size of the dataset.
Very interesting, thanks for sharing. Historically, I’ve been able to detect significant performance differences in MySQL, using this approach, but I’m very open to learning how to improve it. Do you have a suggestion that allows for averaging out hundreds of executions?
Well, in Oracle it’s quite simple – when we have count(1) oracle rewrite this into count() so there is no way to count(1) be faster than count(). It’s the same really. You can check this on your on :)
https://pretius.com/oracle-count-or-count1-that-is-the-question/
https://pretius.com/oracle-count-vs-count1-the-final-battle/
Yes indeed, Oracle gets it right. Thanks for your comment, Monika
The myth I always heard, especially in the early 2000s, was that COUNT(1) was faster than the more popular COUNT(). Sometimes some smarty-pants would try to get people to edit their code to change COUNT() to COUNT(1) – the (incorrect) justification usually being something like COUNT(*) has to check if all columns are null whereas COUNT(1) is just checking if a single constant is null.
Tom Kyte (now-retired Oracle guru and the guy formerly behind AskTom) had to disprove this myth a few times by posting test cases and showing the plans as well as timings. Some people just want to repeat performance tidbits they heard without ever verifying them. :(
The SQL ecosystem is full of people who make spurious claims about SQL performance based on some syntax, without the least bit of verification. I mean, it’s ridiculous in the first place, given SQL’s 4GL promise, to think that one syntax should be faster than another. They should be equally fast, if run by a perfect optimiser. So, all differences are really just implementation details, or if you will, “bugs” of specific vendors and versions.
For example, almost at the same time as you made your comment, this comment came in, just to prove the point :)
https://blog.jooq.org/2016/05/27/correlated-subqueries-are-evil-and-slow-or-are-they/#comment-329222
Recently needed to change count(*) to count(1) within some exists/not exists sub queries in views because in PostgreSQL count(*) created dependencies on the columns in the queried tables, so I was dropping columns and needing to recreate those views, which I did not expect. Very much appreciate your blog content as SQL development is my main job.
Interesting, I can’t reproduce this naively even on PostgreSQL 9.3, let alone 15. I’ve tried:
May be an edge case / bug worth reporting to the community, if you can find a minimal reproducer? https://www.postgresql.org/docs/current/bug-reporting.html
I don’t see why you should have experienced the behaviour you did