Counting Distinct Records in SQL

The SQL language and its depths… Some of you readers might be aware of MySQL’s capability of counting distinct records through the COUNT() aggregate function. The MySQL documentation reads:

COUNT(DISTINCT expr,[expr...])

Returns a count of the number of rows with different non-NULL expr values.

In other words, you can count distinct first and last names very easily:

SELECT COUNT(DISTINCT FIRST_NAME, LAST_NAME)
FROM CUSTOMERS

That’s quite useful, but MySQL-specific (although HSQLDB also supports this particular MySQL syntax). Most other databases, however, do not offer such a feature, even if the SQL-99 standard has specified it long ago:

6.16  <set function specification>

<set function specification> ::=
    COUNT <left paren> <asterisk> <right paren>
  | <general set function>

<general set function> ::=
    <set function type>
        <left paren> [ <set quantifier> ] 
            <value expression> <right paren>

<set function type> ::=
    <computational operation>

<computational operation> ::=
    AVG | MAX | MIN | SUM
  | EVERY | ANY | SOME
  | COUNT

<set quantifier> ::= DISTINCT | ALL

The above was later on merged into 10.9 <aggregate function>, in SQL:2011, with lots of other aggregate function types and features. Now, let’s have a look at 6.23 <value expression>:

6.23 <value expression>

<value expression> ::=
<numeric value expression>
| <string value expression>
| <datetime value expression>
| <interval value expression>
| <boolean value expression>
| <user-defined type value expression>
| <row value expression>                <-- RVE!
| <reference value expression>
| <collection value expression>

Interestingly, you can put a row value expression in some of your aggregate functions. There are additional restrictions, as you cannot SUM() or AVG() records, for instance. But with COUNT() and COUNT(DISTINCT ...) this makes perfect sense. So, according to the SQL standard (and according to HSQLDB’s alternative, and PostgreSQL’s main implementation, which unfortunately isn’t really documented), the following would be the correct way to count distinct first and last names in SQL:

SELECT COUNT(DISTINCT (FIRST_NAME, LAST_NAME))
FROM CUSTOMERS

… which makes perfect sense. Watch out when reading this. This isn’t a DISTINCT() function! This is the DISTINCT keyword applied to a row value expression. See some previous blog posts about row value expressions here:

jOOQ standardises various SQL dialects and hides the above behind the DSL.countDistinct() method.

When you’re not using HSQLDB, MySQL, or PostgreSQL

In other databases, you might resort to some manual SQL transformation. Our example is easy to transform:

SELECT COUNT(*)
FROM (
  SELECT DISTINCT FIRST_NAME, LAST_NAME
  FROM CUSTOMERS
) t

But such query transformation can prove to be tricky, once you have GROUP BY clauses or other aggregate functions involved. Future versions of jOOQ might do that sort of transformation for you.

Row value expressions and the BETWEEN predicate

Now this is a simple example of how SQL clause simulation can get nasty if you want to make use of some more advanced SQL clauses that aren’t supported in all databases. Consider the following predicate and equivalent transformations thereof:

The BETWEEN predicate

The BETWEEN predicate is a convenient form of expressing the fact that one expression A should be in BETWEEN two other expressions B and C. This predicate was defined already in §8.4 of SQL-1992, and then refined in SQL-1999 (adding ASYMMETRIC/SYMMETRIC):

8.3 <between predicate>

Function
    Specify a range comparison.

Format
    <between predicate> ::=
        <row value expression> [ NOT ] BETWEEN 
          [ ASYMMETRIC | SYMMETRIC ]
          <row value expression> AND <row value expression>

While ASYMMETRIC is just a verbose way of expressing the default behaviour of the BETWEEN predicate, SYMMETRIC has the useful property of indicating that the order of B and C is irrelevant. Knowing this, the following transformations can be established:

BETWEEN predicate transformations

The following statements are all equivalent:

A BETWEEN SYMMETRIC B AND C
(A BETWEEN B AND C) OR (A BETWEEN C AND B)
(A >= B AND A <= C) OR (A >= C AND A <= B)

While this is still somewhat readable, try adding row value expressions:

-- The original statement
(A1, A2) BETWEEN SYMMETRIC (B1, B2) AND (C1, C2)

-- Transforming away BETWEEN SYMMETRIC
   (     (A1, A2) >= (B1, B2) 
     AND (A1, A2) <= (C1, C2) )
OR (     (A1, A2) >= (C1, C2) 
     AND (A1, A2) <= (B1, B2) )

-- Transforming away the row value expressions
   (     ((A1 > B1) OR (A1 = B1 AND A2 > B2) OR (A1 = B1 AND A2 = B2))
     AND ((A1 < C1) OR (A1 = C1 AND A2 < C2) OR (A1 = C1 AND A2 = C2)) )
OR (     ((A1 > C1) OR (A1 = C1 AND A2 > C2) OR (A1 = C1 AND A2 = C2))
     AND ((A1 < B1) OR (A1 = B1 AND A2 < B2) OR (A1 = B1 AND A2 = B2)) )

In the lowest expression, some parts could’ve been factored out for “simplicity”. The example is just to give you a picture of what the BETWEEN [SYMMETRIC] predicate really does to row value expressions.

Native SQL support for row value expressions and BETWEEN SYMMETRIC

Here’s a comprehensive list of the 14 SQL dialects supported by jOOQ, and what is natively supported by them:

Database BETWEEN SYMMETRIC RVE = RVE RVE < RVE RVE BETWEEN
CUBRID [1] no yes no no
DB2 no yes yes yes
Derby no no no no
Firebird no no no no
H2 [2] no yes yes yes
HSQLDB yes yes yes yes
Ingres yes no no no
MySQL no yes yes no
Oracle no yes no no
Postgres yes yes yes yes
SQL Server no no no no
SQLite no no no no
Sybase ASE no no no no
Sybase SQL Anywhere no no no no

Explanation:

  • The BETWEEN SYMMETRIC column indicates, whether the database supports the SYMMETRIC keyword in general
  • The RVE = RVE column indicates, whether the database supports row value expressions in general (e.g. in equal comparison predicates)
  • The RVE < RVE column indicates, whether the database supports “ordering” comparison predicates (<, <=, >, >=) along with row value expressions
  • The RVE BETWEEN column indicates, whether the database supports the BETWEEN predicates along with row value expressions

Footnotes:

  • [1]: CUBRID doesn’t really support row value expressions. What looks like a RVE is in fact a SET in CUBRID
  • [2]: H2 doesn’t really support row value expressions. What looks like a RVE is in fact an ARRAY in H2

The IS DISTINCT FROM predicate

The SQL-1999 standard specifies a useful IS DISTINCT FROM predicate, that behaves slightly different from the regular not equal comparison predicate. Here is its simple definition:

8.13 <distinct predicate>

Format
<distinct predicate> ::=
  <row value expression 3> IS DISTINCT FROM <row value expression 4>
  <row value expression 3> ::= <row value expression>
  <row value expression 4> ::= <row value expression>

Purpose of the DISTINCT predicate

The purpose of this DISTINCT predicate is to alleviate handling NULL / UNKNOWN values in predicates. This is best explained using the following truth tables:

Not equal comparison predicate

To many SQL beginners, the following truth table might appear a bit awkward to memorise, as it is not really intuitive if you know NULL from other languages.

  • NULL != NULL yields NULL (not FALSE!)
  • NULL != [ANY] yields NULL (not TRUE!)
  • [ANY] != NULL yields NULL (not TRUE!)
  • [ANY] != [ANY] yields TRUE/FALSE

DISTINCT predicate

The DISTINCT predicate treats NULL like most other languages do

  • NULL IS DISTINCT FROM NULL yields FALSE
  • NULL IS DISTINCT FROM [ANY] yields TRUE
  • [ANY] IS DISTINCT FROM NULL yields TRUE
  • [ANY] IS DISTINCT FROM [ANY] yields TRUE/FALSE
  • NULL IS NOT DISTINCT FROM NULL yields TRUE
  • NULL IS NOT DISTINCT FROM [ANY] yields FALSE
  • [ANY] IS NOT DISTINCT FROM NULL yields FALSE
  • [ANY] IS NOT DISTINCT FROM [ANY] yields TRUE/FALSE

Support of the DISTINCT predicate in various SQL dialects

Not all SQL dialects natively support this predicate. As a matter of fact, only these few open source ones do:

  • Firebird
  • H2
  • HSQLDB
  • Postgres

MySQL

MySQL doesn’t support the SQL standard DISTINCT predicate, but it knows a special equal-to (<=>) operator that behaves the same way

The others

The other databases currently supported by jOOQ do not implement the DISTINCT predicate. These include

  • CUBRID
  • DB2
  • Derby
  • Ingres
  • Oracle
  • SQL Server
  • SQLite
  • Sybase ASE
  • Sybase SQL Anywhere

However, the DISTINCT predicate is simulated by jOOQ as follows:

CASE WHEN [this] IS     NULL AND [field] IS     NULL THEN FALSE
     WHEN [this] IS     NULL AND [field] IS NOT NULL THEN TRUE
     WHEN [this] IS NOT NULL AND [field] IS     NULL THEN TRUE
     WHEN [this] =               [field]             THEN FALSE
     ELSE                                                 TRUE
END

Publicly available SQL standard documents

For your reference, here is a list of publicly available SQL standard documents. Some are very late draft versions, where the final version is closed source and has to be purchased from ISO or ANSI directly:

The links were taken from here:

http://wiki.postgresql.org/wiki/Developer_FAQ