Do You Really Understand SQL’s GROUP BY and HAVING clauses?

Posted on December 4, 2014September 29, 2016 by lukaseder

There are some things in SQL that we simply take for granted without thinking about them properly. One of these things are the GROUP BY and the less popular HAVING clauses. Let’s look at a simple example. For this example, we’ll reiterate the example database we’ve seen in this previous article about the awesome LEAD(), LAG(), FIRST_VALUE(), LAST_VALUE() functions:


CREATE TABLE countries (
  code CHAR(2) NOT NULL,
  year INT NOT NULL,
  gdp_per_capita DECIMAL(10, 2) NOT NULL,
  govt_debt DECIMAL(10, 2) NOT NULL
);

Before there were window functions, aggregations were made only with GROUP BY. A typical question that we could ask our database using SQL is:

What are the top 3 average government debts in percent of the GDP for those countries whose GDP per capita was over 40’000 dollars in every year in the last four years

Whew. Some (academic) business requirements. In SQL (PostgreSQL dialect), we would write:


select code, avg(govt_debt)
from countries
where year > 2010
group by code
having min(gdp_per_capita) >= 40000
order by 2 desc
limit 3

Or, with inline comments


-- The average government debt
select code, avg(govt_debt)

-- for those countries
from countries

-- in the last four years
where year > 2010

-- yepp, for the countries
group by code

-- whose GDP p.c. was over 40'000 in every year
having min(gdp_per_capita) >= 40000

-- The top 3
order by 2 desc
limit 3

The result being:

code     avg
------------
JP    193.00
US     91.95
DE     56.00

Remember the 10 easy steps to a complete understanding of SQL:

FROM generates the data set
WHERE filters the generated data set
GROUP BY aggregates the filtered data set
HAVING filters the aggregated data set
SELECT transforms the filters aggregated data set
ORDER BY sorts the transformed data set
LIMIT .. OFFSET frames the sorted data set

… where LIMIT .. OFFSET may come in very different flavours.

The empty GROUP BY clause

A very special case of GROUP BY is the explicit or implicit empty GROUP BY clause. Here’s a question that we could ask our database:

Are there any countries at all with a GDP per capita of more than 50’000 dollars?

And in SQL, we’d write:


select true answer
from countries
having max(gdp_per_capita) >= 50000

The result being

answer
------
t

You could of course have used the EXISTS clause instead (please don’t use COUNT(*) in these cases):


select exists(
  select 1 
  from countries 
  where gdp_per_capita >= 50000
);

And we would get, again:

answer
------
t

… but let’s focus on the plain HAVING clause. Not everyone knows that HAVING can be used all by itself, or what it even means to have HAVING all by itself. Already the SQL 1992 standard allowed for the use of HAVING without GROUP BY, but it wasn’t until the introduction of GROUPING SETS in SQL:1999, when the semantics of this syntax was retroactively unambiguously defined:

7.10 <having clause>
<having clause> ::= HAVING <search condition>
Syntax Rules 1) Let HC be the <having clause>. Let TE be the <table expression> that immediately contains HC. If TE does not immediately contain a <group by clause>, then GROUP BY ( ) is implicit.

That’s interesting. There is an implicit GROUP BY ( ), if we leave out the explicit GROUP BY clause. If you’re willing to delve into the SQL standard a bit more, you’ll find:

<group by clause> ::=
    GROUP BY <grouping specification>

<grouping specification> ::=
    <grouping column reference>
  | <rollup list>
  | <cube list>
  | <grouping sets list>
  | <grand total>
  | <concatenated grouping>

<grouping set> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grand total>

<grand total> ::= <left paren> <right paren>

So, GROUP BY ( ) is essentially grouping by a “grand total”, which is what’s intuitively happening, if we just look for the highest ever GDP per capita:


select max(gdp_per_capita)
from countries;

Which yields:

     max
--------
52409.00

The above query is also implicitly the same as this one (which isn’t supported by PostgreSQL):


select max(gdp_per_capita)
from countries
group by ();

(but beware, this isn’t always the case – read this interesting article by Glenn Paulley for details)

The awesome GROUPING SETs

In this section of the article, we’ll be leaving PostgreSQL land, entering SQL Server land, as PostgreSQL shamefully doesn’t implement any of the following (yet). Now, we cannot understand the grand total (empty GROUP BY ( ) clause), without having a short look at the SQL:1999 standard GROUPING SETS. Some of you may have heard of CUBE() or ROLLUP() grouping functions, which are just syntactic sugar for commonly used GROUPING SETS. Let’s try to answer this question in a single query:

What are the highest GDP per capita values per year OR per country

In SQL, we’ll write:


select code, year, max(gdp_per_capita)
from countries
group by grouping sets ((code), (year))

Which yields two concatenated sets of records:

code    year    max
------------------------
NULL    2009    46999.00 <- grouped by year
NULL    2010    48358.00
NULL    2011    51791.00
NULL    2012    52409.00

CA      NULL    52409.00 <- grouped by code
DE      NULL    44355.00
FR      NULL    42578.00
GB      NULL    38927.00
IT      NULL    36988.00
JP      NULL    46548.00
RU      NULL    14091.00
US      NULL    51755.00

That’s kind of nice, isn’t it? It’s essentially just the same thing as this query with UNION ALL


select code, null, max(gdp_per_capita)
from countries
group by code
union all
select null, year, max(gdp_per_capita)
from countries
group by year;

In fact, it’s exactly the same thing, as the latter explicitly concatenates two sets of grouped records… i.e. two GROUPING SETS. This SQL Server documentation page also explains it very nicely.

And the most powerful of them all: CUBE()

Now, imagine, you’d like to add the “grand total”, and also the highest value per country AND year, producing four different concatenated sets. To limit the results, we’ll also filter out GDPs of less than 48000 for this example:


select 
  code, year, max(gdp_per_capita), 
  grouping_id(code, year) grp
from countries
where gdp_per_capita >= 48000
group by grouping sets (
  (), 
  (code), 
  (year), 
  (code, year)
)
order by grp desc;

This nice-looking query will now produce all the possible grouping combinations that we can imagine, including the grand total, in order to produce:

code    year    max         grp
---------------------------------
NULL    NULL    52409.00    3 <- grand total

NULL    2012    52409.00    2 <- group by year
NULL    2010    48358.00    2
NULL    2011    51791.00    2

CA      NULL    52409.00    1 <- group by code
US      NULL    51755.00    1

US      2010    48358.00    0 <- group by code and year
CA      2012    52409.00    0
US      2012    51755.00    0
CA      2011    51791.00    0
US      2011    49855.00    0

And because this is quite a common operation in reporting and in OLAP, we can simply write the same by using the CUBE() function:


select 
  code, year, max(gdp_per_capita), 
  grouping_id(code, year) grp
from countries
where gdp_per_capita >= 48000
group by cube(code, year)
order by grp desc;

Compatibility

While the first couple of queries also worked on PostgreSQL, the ones that are using GROUPING SETS will work only on 4 out of 17 RDBMS currently supported by jOOQ. These are:

DB2
Oracle
SQL Server
Sybase SQL Anywhere

jOOQ also fully supports the previously mentioned syntaxes. The GROUPING SETS variant can be written as such:


// Countries is an object generated by the jOOQ
// code generator for the COUNTRIES table.
Countries c = COUNTRIES;

ctx.select(
       c.CODE,
       c.YEAR,
       max(c.GDP_PER_CAPITA),
       groupingId(c.CODE, c.YEAR).as("grp"))
   .from(c)
   .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000")))
   .groupBy(groupingSets(new Field[][] {
       {},
       { c.CODE },
       { c.YEAR },
       { c.CODE, c.YEAR }
   }))
   .orderBy(fieldByName("grp").desc())
   .fetch();

… or the CUBE() version:


ctx.select(
       c.CODE,
       c.YEAR,
       max(c.GDP_PER_CAPITA),
       groupingId(c.CODE, c.YEAR).as("grp"))
   .from(c)
   .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000")))
   .groupBy(cube(c.CODE, c.YEAR))
   .orderBy(fieldByName("grp").desc())
   .fetch();

jOOQ is the best way to write SQL in Java

… and in the future, we’ll emulate GROUPING SETS by their equivalent UNION ALL queries in those databases that do not natively support GROUPING SETS. Try it out for yourself and download your free jOOQ trial now!

Published by lukaseder

I made jOOQ View all posts by lukaseder

23 thoughts on “Do You Really Understand SQL’s GROUP BY and HAVING clauses?”

javarevisited says:

December 4, 2014 at 17:45

Nice post Lukas. Add a WHERE clause together with HAVING and it will be more confusing :) By the way nice to know about GROUPING SETS though, thanks.

Loading...

Reply
1. lukaseder says:
  
  December 4, 2014 at 17:57
  
  Well, the distinction between WHERE and HAVING is actually quite clear. One is applied before GROUP BY, the other after GROUP BY…
  
  Loading...
  
  Reply
  1. javarevisited says:
    
    December 5, 2014 at 18:54
    
    Indeed
    
    Loading...
    
    Reply
Alexander Shopov says:

December 7, 2014 at 12:13

You may mention that rows produced by cube, rollup and grouping sets are called superaggregates and additionally stress on the importance of proper ‘ORDER BY’ clause.

Loading...

Reply
1. lukaseder says:
  
  December 7, 2014 at 20:07
  
  Interesting. I haven’t found a reference to the term “superaggregate” in the SQL:2011 standard document, but some people do indeed seem to refer to those rows as super aggregates. Do you know anything about the origins of that term?
  
  Loading...
  
  Reply
Bubumuk says:

December 12, 2014 at 17:37

Nice post! I learned a lot of new things.

Loading...

Reply
Sandeep (@testanalystat) says:

May 13, 2015 at 13:11
Hey, Very nice post. Clear and concise. When I read the scenario

What are the top 3 average government debts in percent of the GDP for those countries whose GDP per capita was over 40’000 dollars in every year in the last four years

I started creating my own query (before looking at yours) and here is mine.

Data I inserted:
```
1      IN	2009	3000	279.22
2	XL	2014	96000	979.22	
3	US	2010	30000	569.22	
4	BE	2011	300000	479.22	
5	AU	2012	69000	779.22	
6	EU	2013	36000	879.22	
7	AU	2014	69000	1779.22	
8	BC	2015	690000	2779.22	
9	XL	2014	45000	3779.22
```
My query which prepared and ran against the data
```
select code, avg(govt_debt) ReturnedValue 
from countries 
where gdp_per_capita > 40000 
and year in (2011,2012,2013,2014,2015) 
group by code 
order by ReturnedValue desc
```
Output is
```
BC	2779.22
XL	2379.22
AU	1279.22
BE	479.22
```
When I used your query
```
select code, avg(govt_debt)
from countries
where year > 2010
group by code
having min(gdp_per_capita) >= 40000
order by 2 desc
```
Output is
```
BC	2779.22
XL	2379.22
AU	1279.22
BE	479.22
```
So no difference. But I am not saying my query is correct (it may be false positive).

Q1. Actually, I find it bit difficult to think about involving Having clause while creating the query May you please clarify where I am wrong?

Q2. What is the ‘In percent of the GDP’ means?

Please advise

Loading...
Reply
1. lukaseder says:
  
  May 13, 2015 at 16:52
  Hi there.
  
  Q1: Your query returns the same by accident:
  - In your case (predicate in WHERE clause), the filtering takes place BEFORE the grouping. I.e. if there are entries with gdp_per_capita <= 40000, only those would be filtered out, but the rest would still be grouped.
  - In the case of the article (predicate in HAVING clause), the filtering takes place AFTER the grouping. I.e. if there are any entries with gdp_per_capita <= 40000, the whole group would be removed from the result.
  Q2: “In percent of the GDP” doesn’t really mean anything specific. The data used here (govt_debt) is simply the debt in percent of the GDP, i.e that percentage is already in the govt_debt column. I guess that’s not really clear in the blog post… For more details, consider this blog post: https://blog.jooq.org/2014/11/07/dont-miss-out-on-awesome-sql-power-with-first_value-last_value-lead-and-lag
  
  Loading...
  Reply
  1. Vaishali says:
    
    October 9, 2015 at 21:33
    hii.. this is a very nice post..
    I am a new-bee in sql..
    I am a little bit confused in group by clause.
    
    qsn- what is the difference overall in following 2 queries?
    
    select code, avg(govt_debt) ReturnedValue from countries where min(gdp per capita)>40000 and year in (2011,2012,2013,2014,2015) group by code order by ReturnedValue desc;
    
    and
    
    select code, avg(govt_debt) from countries where year > 2010 group by code having min(gdp_per_capita) >= 40000 order by 2 desc;
    
    Loading...
    Reply
    1. lukaseder says:
      
      October 10, 2015 at 08:52
      
      The first one doesn’t work. You cannot have aggregate functions in your WHERE clause
      
      Loading...
      
      Reply
  2. nsawaya says:
    
    March 10, 2016 at 14:38
    
    Hey, first thanks for the great blog posts.
    I have a question regarding your answer,
    does the first point mean that Sandeep’s query does not fulfill the “every year” condition??
    
    Loading...
    
    Reply
    1. lukaseder says:
      
      March 10, 2016 at 14:40
      
      More or less, yes
      
      Loading...
      
      Reply
Robert Calhouno says:

May 27, 2015 at 17:53

Filters the dataset, not reduce.

Loading...

Reply
1. lukaseder says:
  
  May 27, 2015 at 18:05
  
  You’re right, that’s more idiomatic wording… Specifically as “reduce” is often used as synonym for “aggregate”
  
  Loading...
  
  Reply
Heikki Rauhala says:

May 30, 2016 at 12:04

PostgreSQL 9.5 now includes the CUBE syntax.

Loading...

Reply
Brenda Walker says:

June 24, 2016 at 03:36
Not sure if i am posting in the right place … if not, please bear withme :)

I’m trying to use a WHERE NOT EXIST where both tables include a GROUP BY …. I think I’m lost… please help.
```
select  a.FGBTRNH_DOC_CODE
        from FGBTRNH a
        where a.FGBTRNH_DOC_CODE LIKE 'J16%'
        group by a.FGBTRNH_DOC_CODE, a.FGBTRNH_RUCL_CODE
        order by FGBTRNH_DOC_CODE

WHERE NOT EXISTS

        (
        select  distinct b.FGBTRNH_DOC_CODE
        from FGBTRNH b
        where b.FGBTRNH_DOC_CODE LIKE 'J16%'
        and   b.FGBTRNH_SEQ_NUM = 1
        group by b.FGBTRNH_DOC_CODE, a.FGBTRNH_RUCL_CODE  
        order by b.FGBTRNH_DOC_CODE
        )
 ;
```
Loading...
Reply
1. lukaseder says:
  
  June 24, 2016 at 22:22
  
  Not sure if i am posting in the right place
  
  Perhaps you are, although Stack Overflow would almost certainly be a better place. In any case, I haven’t seen any question. Hard to help withouth that…?
  
  Loading...
  
  Reply
brendadwalker says:

June 24, 2016 at 22:32

Obviously … I am new to SQL … How do I properly code this …

Question 1: Find SET A where not exist in SET B

Set A:
select distinct a.FGBTRNH_DOC_CODE
from FGBTRNH a
where a.FGBTRNH_DOC_CODE LIKE ‘J16%’
group by a.FGBTRNH_DOC_CODE, a.FGBTRNH_RUCL_CODE
order by FGBTRNH_DOC_CODE

Set B:
select distinct b.FGBTRNH_DOC_CODE
from FGBTRNH b
where b.FGBTRNH_DOC_CODE LIKE ‘J16%’
and b.FGBTRNH_SEQ_NUM = 1
group by b.FGBTRNH_DOC_CODE, a.FGBTRNH_RUCL_CODE
order by b.FGBTRNH_DOC_CODE

Question 2: How do i distinguish WHEN to use: NOT EXIST IN vs. NOT IN vs. MINUS

Loading...

Reply
1. lukaseder says:
  
  June 24, 2016 at 22:53
  
  It’s true, it is a question best suited for https://stackoverflow.com. I highly recommend asking it there.
  
  Loading...
  
  Reply
p0nd says:

September 29, 2016 at 08:45

There is a dead link in this article:

10-easy-steps-to-a-complete-understanding-of-sql

Loading...

Reply
1. lukaseder says:
  
  September 29, 2016 at 19:09
  
  Thank you very much for letting us know. Indeed, we’ve migrated that post to our own blog here:
  
  https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql/
  
  Loading...
  
  Reply
Andy says:

January 31, 2018 at 08:34

Great article! Can you add INSERT statements with test data to this article. It’s so much earlier to learn this stuff when you can create the database and run the sql queries discussed in the article. But without data it’s more difficult…

Thanks

Loading...

Reply
1. lukaseder says:
  
  January 31, 2018 at 08:53
  
  Thanks. Newer articles all use the https://www.jooq.org/sakila schema, but the old ones don’t. But you’re lucky, the schema (and data) for this article can be found here: https://github.com/jOOQ/jOOQ/blob/master/jOOQ-examples/jOOQ-javafx-example/src/main/resources/db-h2.sql
  
  Loading...
  
  Reply

The empty GROUP BY clause

The awesome GROUPING SETs

And the most powerful of them all: CUBE()

Compatibility

Like this:

Published by lukaseder

23 thoughts on “Do You Really Understand SQL’s GROUP BY and HAVING clauses?”

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.