Do You Really Understand SQL’s GROUP BY and HAVING clauses?

There are some things in SQL that we simply take for granted without thinking about them properly.

One of these things are the GROUP BY and the less popular HAVING clauses. Let’s look at a simple example. For this example, we’ll reiterate the example database we’ve seen in this previous article about the awesome LEAD(), LAG(), FIRST_VALUE(), LAST_VALUE() functions:

CREATE TABLE countries (
  code CHAR(2) NOT NULL,
  year INT NOT NULL,
  gdp_per_capita DECIMAL(10, 2) NOT NULL,
  govt_debt DECIMAL(10, 2) NOT NULL
);

Before there were window functions, aggregations were made only with GROUP BY. A typical question that we could ask our database using SQL is:

What are the top 3 average government debts in percent of the GDP for those countries whose GDP per capita was over 40’000 dollars in every year in the last four years

Whew. Some (academic) business requirements.

In SQL (PostgreSQL dialect), we would write:

select code, avg(govt_debt)
from countries
where year > 2010
group by code
having min(gdp_per_capita) >= 40000
order by 2 desc
limit 3

Or, with inline comments

-- The average government debt
select code, avg(govt_debt)

-- for those countries
from countries

-- in the last four years
where year > 2010

-- yepp, for the countries
group by code

-- whose GDP p.c. was over 40'000 in every year
having min(gdp_per_capita) >= 40000

-- The top 3
order by 2 desc
limit 3

The result being:

code     avg
------------
JP    193.00
US     91.95
DE     56.00

Remember the 10 easy steps to a complete understanding of SQL:

  1. FROM generates the data set
  2. WHERE filters the generated data set
  3. GROUP BY aggregates the filtered data set
  4. HAVING filters the aggregated data set
  5. SELECT transforms the filters aggregated data set
  6. ORDER BY sorts the transformed data set
  7. LIMIT .. OFFSET frames the sorted data set

… where LIMIT .. OFFSET may come in very different flavours.

The empty GROUP BY clause

A very special case of GROUP BY is the explicit or implicit empty GROUP BY clause. Here’s a question that we could ask our database:

Are there any countries at all with a GDP per capita of more than 50’000 dollars?

And in SQL, we’d write:

select true answer
from countries
having max(gdp_per_capita) >= 50000

The result being

answer
------
t

You could of course have used the EXISTS clause instead (please don’t use COUNT(*) in these cases):

select exists(
  select 1 
  from countries 
  where gdp_per_capita >= 50000
);

And we would get, again:

answer
------
t

… but let’s focus on the plain HAVING clause.

Not everyone knows that HAVING can be used all by itself, or what it even means to have HAVING all by itself. Already the SQL 1992 standard allowed for the use of HAVING without GROUP BY, but it wasn’t until the introduction of GROUPING SETS in SQL:1999, when the semantics of this syntax was retroactively unambiguously defined:

7.10 <having clause>

<having clause> ::= HAVING <search condition>

Syntax Rules

1) Let HC be the <having clause>. Let TE be the <table expression> that immediately contains
HC. If TE does not immediately contain a <group by clause>, then GROUP BY ( ) is implicit.

That’s interesting. There is an implicit GROUP BY ( ), if we leave out the explicit GROUP BY clause. If you’re willing to delve into the SQL standard a bit more, you’ll find:

<group by clause> ::=
    GROUP BY <grouping specification>

<grouping specification> ::=
    <grouping column reference>
  | <rollup list>
  | <cube list>
  | <grouping sets list>
  | <grand total>
  | <concatenated grouping>

<grouping set> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grand total>

<grand total> ::= <left paren> <right paren>

So, GROUP BY ( ) is essentially grouping by a “grand total”, which is what’s intuitively happening, if we just look for the highest ever GDP per capita:

select max(gdp_per_capita)
from countries;

Which yields:

     max
--------
52409.00

The above query is also implicitly the same as this one (which isn’t supported by PostgreSQL):

select max(gdp_per_capita)
from countries
group by ();

(but beware, this isn’t always the case – read this interesting article by Glenn Paulley for details)

The awesome GROUPING SETs

In this section of the article, we’ll be leaving PostgreSQL land, entering SQL Server land, as PostgreSQL shamefully doesn’t implement any of the following (yet).

Now, we cannot understand the grand total (empty GROUP BY ( ) clause), without having a short look at the SQL:1999 standard GROUPING SETS. Some of you may have heard of CUBE() or ROLLUP() grouping functions, which are just syntactic sugar for commonly used GROUPING SETS. Let’s try to answer this question in a single query:

What are the highest GDP per capita values per year OR per country

In SQL, we’ll write:

select code, year, max(gdp_per_capita)
from countries
group by grouping sets ((code), (year))

Which yields two concatenated sets of records:

code    year    max
------------------------
NULL    2009    46999.00 <- grouped by year
NULL    2010    48358.00
NULL    2011    51791.00
NULL    2012    52409.00

CA      NULL    52409.00 <- grouped by code
DE      NULL    44355.00
FR      NULL    42578.00
GB      NULL    38927.00
IT      NULL    36988.00
JP      NULL    46548.00
RU      NULL    14091.00
US      NULL    51755.00

That’s kind of nice, isn’t it? It’s essentially just the same thing as this query with UNION ALL

select code, null, max(gdp_per_capita)
from countries
group by code
union all
select null, year, max(gdp_per_capita)
from countries
group by year;

In fact, it’s exactly the same thing, as the latter explicitly concatenates two sets of grouped records… i.e. two GROUPING SETS. This SQL Server documentation page also explains it very nicely.

And the most powerful of them all: CUBE()

Now, imagine, you’d like to add the “grand total”, and also the highest value per country AND year, producing four different concatenated sets. To limit the results, we’ll also filter out GDPs of less than 48000 for this example:

select 
  code, year, max(gdp_per_capita), 
  grouping_id(code, year) grp
from countries
where gdp_per_capita >= 48000
group by grouping sets (
  (), 
  (code), 
  (year), 
  (code, year)
)
order by grp desc;

This nice-looking query will now produce all the possible grouping combinations that we can imagine, including the grand total, in order to produce:

code    year    max         grp
---------------------------------
NULL    NULL    52409.00    3 <- grand total

NULL    2012    52409.00    2 <- group by year
NULL    2010    48358.00    2
NULL    2011    51791.00    2

CA      NULL    52409.00    1 <- group by code
US      NULL    51755.00    1

US      2010    48358.00    0 <- group by code and year
CA      2012    52409.00    0
US      2012    51755.00    0
CA      2011    51791.00    0
US      2011    49855.00    0

And because this is quite a common operation in reporting and in OLAP, we can simply write the same by using the CUBE() function:

select 
  code, year, max(gdp_per_capita), 
  grouping_id(code, year) grp
from countries
where gdp_per_capita >= 48000
group by cube(code, year)
order by grp desc;

Compatibility

While the first couple of queries also worked on PostgreSQL, the ones that are using GROUPING SETS will work only on 4 out of 17 RDBMS currently supported by jOOQ. These are:

  • DB2
  • Oracle
  • SQL Server
  • Sybase SQL Anywhere

jOOQ also fully supports the previously mentioned syntaxes. The GROUPING SETS variant can be written as such:

// Countries is an object generated by the jOOQ
// code generator for the COUNTRIES table.
Countries c = COUNTRIES;

ctx.select(
       c.CODE,
       c.YEAR,
       max(c.GDP_PER_CAPITA),
       groupingId(c.CODE, c.YEAR).as("grp"))
   .from(c)
   .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000")))
   .groupBy(groupingSets(new Field[][] {
       {},
       { c.CODE },
       { c.YEAR },
       { c.CODE, c.YEAR }
   }))
   .orderBy(fieldByName("grp").desc())
   .fetch();

… or the CUBE() version:

ctx.select(
       c.CODE,
       c.YEAR,
       max(c.GDP_PER_CAPITA),
       groupingId(c.CODE, c.YEAR).as("grp"))
   .from(c)
   .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000")))
   .groupBy(cube(c.CODE, c.YEAR))
   .orderBy(fieldByName("grp").desc())
   .fetch();

jOOQ is the best way to write SQL in Java

… and in the future, we’ll emulate GROUPING SETS by their equivalent UNION ALL queries in those databases that do not natively support GROUPING SETS.

Try it out for yourself and download your free jOOQ trial now!

Row value expressions and the BETWEEN predicate

Now this is a simple example of how SQL clause simulation can get nasty if you want to make use of some more advanced SQL clauses that aren’t supported in all databases. Consider the following predicate and equivalent transformations thereof:

The BETWEEN predicate

The BETWEEN predicate is a convenient form of expressing the fact that one expression A should be in BETWEEN two other expressions B and C. This predicate was defined already in §8.4 of SQL-1992, and then refined in SQL-1999 (adding ASYMMETRIC/SYMMETRIC):

8.3 <between predicate>

Function
    Specify a range comparison.

Format
    <between predicate> ::=
        <row value expression> [ NOT ] BETWEEN 
          [ ASYMMETRIC | SYMMETRIC ]
          <row value expression> AND <row value expression>

While ASYMMETRIC is just a verbose way of expressing the default behaviour of the BETWEEN predicate, SYMMETRIC has the useful property of indicating that the order of B and C is irrelevant. Knowing this, the following transformations can be established:

BETWEEN predicate transformations

The following statements are all equivalent:

A BETWEEN SYMMETRIC B AND C
(A BETWEEN B AND C) OR (A BETWEEN C AND B)
(A >= B AND A <= C) OR (A >= C AND A <= B)

While this is still somewhat readable, try adding row value expressions:

-- The original statement
(A1, A2) BETWEEN SYMMETRIC (B1, B2) AND (C1, C2)

-- Transforming away BETWEEN SYMMETRIC
   (     (A1, A2) >= (B1, B2) 
     AND (A1, A2) <= (C1, C2) )
OR (     (A1, A2) >= (C1, C2) 
     AND (A1, A2) <= (B1, B2) )

-- Transforming away the row value expressions
   (     ((A1 > B1) OR (A1 = B1 AND A2 > B2) OR (A1 = B1 AND A2 = B2))
     AND ((A1 < C1) OR (A1 = C1 AND A2 < C2) OR (A1 = C1 AND A2 = C2)) )
OR (     ((A1 > C1) OR (A1 = C1 AND A2 > C2) OR (A1 = C1 AND A2 = C2))
     AND ((A1 < B1) OR (A1 = B1 AND A2 < B2) OR (A1 = B1 AND A2 = B2)) )

In the lowest expression, some parts could’ve been factored out for “simplicity”. The example is just to give you a picture of what the BETWEEN [SYMMETRIC] predicate really does to row value expressions.

Native SQL support for row value expressions and BETWEEN SYMMETRIC

Here’s a comprehensive list of the 14 SQL dialects supported by jOOQ, and what is natively supported by them:

Database BETWEEN SYMMETRIC RVE = RVE RVE < RVE RVE BETWEEN
CUBRID [1] no yes no no
DB2 no yes yes yes
Derby no no no no
Firebird no no no no
H2 [2] no yes yes yes
HSQLDB yes yes yes yes
Ingres yes no no no
MySQL no yes yes no
Oracle no yes no no
Postgres yes yes yes yes
SQL Server no no no no
SQLite no no no no
Sybase ASE no no no no
Sybase SQL Anywhere no no no no

Explanation:

  • The BETWEEN SYMMETRIC column indicates, whether the database supports the SYMMETRIC keyword in general
  • The RVE = RVE column indicates, whether the database supports row value expressions in general (e.g. in equal comparison predicates)
  • The RVE < RVE column indicates, whether the database supports “ordering” comparison predicates (<, <=, >, >=) along with row value expressions
  • The RVE BETWEEN column indicates, whether the database supports the BETWEEN predicates along with row value expressions

Footnotes:

  • [1]: CUBRID doesn’t really support row value expressions. What looks like a RVE is in fact a SET in CUBRID
  • [2]: H2 doesn’t really support row value expressions. What looks like a RVE is in fact an ARRAY in H2

Row value expressions and the NULL predicate

Row value expressions are something very powerful in SQL. They have been around since the early days of standard SQL, e.g. in SQL 1992, even if not all databases implement them correctly, still today.

As always, the NULL predicate is one that is a bit tricky to understand, also in the context of row value expressions. Have a look at the following expressions:

(A, B) IS NULL
(A, B) IS NOT NULL

The SQL 1992 standard defines that:

8.6  

General Rules

1) Let R be the value of the <row value constructor>.

2) If all the values in R are the null value, then "R IS NULL" is
   true; otherwise, it is false.

3) If none of the values in R are the null value, then "R IS NOT
   NULL" is true; otherwise, it is false.

   Note: For all R, "R IS NOT NULL" has the same result as "NOT
   R IS NULL" if and only if R is of degree 1. Table 12, "<null
   predicate> semantics", specifies this behavior.

Pay some special attention to paragraph 3). Yes, the following two predicates are NOT equivalent!

    (A, B) IS NOT NULL
NOT((A, B) IS     NULL)

This is easy to understand, when factoring out the rules of equivalency:

(A, B) IS NOT NULL                -- equivalent to...
A IS NOT NULL AND B IS NOT NULL   -- equivalent to...
NOT(A IS NULL) AND NOT(B IS NULL) -- equivalent to...
NOT(A IS NULL OR B IS NULL)

-- whereas...
NOT((A, B) IS NULL)               -- equivalent to...
NOT(A IS NULL AND B IS NULL)

The truth table also nicely documents this:

+----------------+-------+-------------+------------+--------------+
|                | R IS  | R IS NOT    | NOT R IS   | NOT R IS NOT |
| Expression     | NULL  | NULL        | NULL       | NULL         |
+----------------+-------+-------------+------------+--------------+
| degree 1: null | true  | false       | false      |  true        |
| degree 1: not  | false | true        | true       |  false       |
| null           |       |             |            |              |
| degree > 1:    | true  | false       | false      |  true        |
| all null       |       |             |            |              |
| degree > 1:    | false | false       | true       |  true        |
| some null      |       |             |            |              |
| degree > 1:    | false | true        | true       |  false       |
| none null      |       |             |            |              |
+----------------+-------+-------------+------------+--------------+

More on row value expressions

jOOQ 3.0 will introduce formal and typesafe support for row value expressions and predicates based thereupon. Stay tuned for more interesting insight and articles about row vlaue expressions and how they’re supported (and/or simulated) in various SQL dialects

Publicly available SQL standard documents

For your reference, here is a list of publicly available SQL standard documents. Some are very late draft versions, where the final version is closed source and has to be purchased from ISO or ANSI directly:

The links were taken from here:

http://wiki.postgresql.org/wiki/Developer_FAQ

SQL incompatibilities: NOT IN and NULL values

This is something where many hours of debugging have been spent in the lives of many SQL developers. The various situations where you can have NULL values in NOT IN predicates and anti-joins. Here’s a typical situation:

with data as (
  select 1 as id from dual union all
  select 2 as id from dual union all
  select 3 as id from dual
)
select * from data
where id not in (1, null)

What do you think this will return? Well, since “dual” indicates an Oracle database, you might say: “an empty result set”. And you would be right for Oracle. In fact, you would be right for any of these databases:

  • DB2
  • Derby
  • H2
  • Ingres
  • Oracle
  • Postgres
  • SQL Server
  • SQLite
  • Sybase

BUT! You would be wrong for any of these ones:

  • HSQLDB
  • MySQL
  • Sybase ASE

Why the discrepancy?

Intuitively, you’d say that all the big ones treat NULL specially in NOT IN predicates, and it is easy to understand, why:

-- This predicate here...
id not in (1, null)

-- Could be seen as equivalent to this one:
id != 1 and id != null

There’s no id that fulfills the above predicate id != null (not even null itself), hence an empty result set. MySQL is known for some strong abuse of SQL standards compliance, so it’s not surprising that they tweaked this syntax as well.

But wait!

HSQLDB 2.0 is one of the most standards-compliant databases out there, could they really have gotten it wrong? Let’s consider the standard: SQL 1992, chapter 8.4 <in predicate>:

<in predicate> ::=
   <row value constructor>
      [ NOT ] IN <in predicate value>

<in predicate value> ::=
   <table subquery>
      | <left paren> <in value list> <right paren>

<in value list> ::=
   <value expression> { <comma> <value expression> }...

 

And then, further down:

2) Let RVC be the <row value constructor> and 
   let IPV be the <in predicate value>.

3) The expression
     RVC NOT IN IPV

   is equivalent to
     NOT ( RVC IN IPV )

4) The expression
     RVC IN IPV

   is equivalent to
     RVC = ANY IPV

 

So in fact, this can be said:

ID NOT IN (1, NULL) is equivalent to
NOT (ID IN (1, NULL)), equivalent to
NOT (ID = ANY(1, NULL)), equivalent to
NOT (ID = 1 OR ID = NULL), equivalent to
NOT (ID = 1) AND NOT (ID = NULL), which is always UNKNOWN

Conclusion

It looks for once, that HSQLDB 2.0 is not standards-compliant in that evaluating the expression inside NOT() before applying NOT() has a different outcome from transforming NOT() into a normalised boolean expression, and then evaluating the expression. For SQL developers, all of this can just mean:

Keep NULL out of NOT IN predicates or be doomed!