Do You Really Understand SQL’s GROUP BY and HAVING clauses?

There are some things in SQL that we simply take for granted without thinking about them properly.

One of these things are the GROUP BY and the less popular HAVING clauses. Let’s look at a simple example. For this example, we’ll reiterate the example database we’ve seen in this previous article about the awesome LEAD(), LAG(), FIRST_VALUE(), LAST_VALUE() functions:

CREATE TABLE countries (
  code CHAR(2) NOT NULL,
  year INT NOT NULL,
  gdp_per_capita DECIMAL(10, 2) NOT NULL,
  govt_debt DECIMAL(10, 2) NOT NULL
);

Before there were window functions, aggregations were made only with GROUP BY. A typical question that we could ask our database using SQL is:

What are the top 3 average government debts in percent of the GDP for those countries whose GDP per capita was over 40’000 dollars in every year in the last four years

Whew. Some (academic) business requirements.

In SQL (PostgreSQL dialect), we would write:

select code, avg(govt_debt)
from countries
where year > 2010
group by code
having min(gdp_per_capita) >= 40000
order by 2 desc
limit 3

Or, with inline comments

-- The average government debt
select code, avg(govt_debt)

-- for those countries
from countries

-- in the last four years
where year > 2010

-- yepp, for the countries
group by code

-- whose GDP p.c. was over 40'000 in every year
having min(gdp_per_capita) >= 40000

-- The top 3
order by 2 desc
limit 3

The result being:

code     avg
------------
JP    193.00
US     91.95
DE     56.00

Remember the 10 easy steps to a complete understanding of SQL:

  1. FROM generates the data set
  2. WHERE filters the generated data set
  3. GROUP BY aggregates the filtered data set
  4. HAVING filters the aggregated data set
  5. SELECT transforms the filters aggregated data set
  6. ORDER BY sorts the transformed data set
  7. LIMIT .. OFFSET frames the sorted data set

… where LIMIT .. OFFSET may come in very different flavours.

The empty GROUP BY clause

A very special case of GROUP BY is the explicit or implicit empty GROUP BY clause. Here’s a question that we could ask our database:

Are there any countries at all with a GDP per capita of more than 50’000 dollars?

And in SQL, we’d write:

select true answer
from countries
having max(gdp_per_capita) >= 50000

The result being

answer
------
t

You could of course have used the EXISTS clause instead (please don’t use COUNT(*) in these cases):

select exists(
  select 1 
  from countries 
  where gdp_per_capita >= 50000
);

And we would get, again:

answer
------
t

… but let’s focus on the plain HAVING clause.

Not everyone knows that HAVING can be used all by itself, or what it even means to have HAVING all by itself. Already the SQL 1992 standard allowed for the use of HAVING without GROUP BY, but it wasn’t until the introduction of GROUPING SETS in SQL:1999, when the semantics of this syntax was retroactively unambiguously defined:

7.10 <having clause>

<having clause> ::= HAVING <search condition>

Syntax Rules

1) Let HC be the <having clause>. Let TE be the <table expression> that immediately contains
HC. If TE does not immediately contain a <group by clause>, then GROUP BY ( ) is implicit.

That’s interesting. There is an implicit GROUP BY ( ), if we leave out the explicit GROUP BY clause. If you’re willing to delve into the SQL standard a bit more, you’ll find:

<group by clause> ::=
    GROUP BY <grouping specification>

<grouping specification> ::=
    <grouping column reference>
  | <rollup list>
  | <cube list>
  | <grouping sets list>
  | <grand total>
  | <concatenated grouping>

<grouping set> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grand total>

<grand total> ::= <left paren> <right paren>

So, GROUP BY ( ) is essentially grouping by a “grand total”, which is what’s intuitively happening, if we just look for the highest ever GDP per capita:

select max(gdp_per_capita)
from countries;

Which yields:

     max
--------
52409.00

The above query is also implicitly the same as this one (which isn’t supported by PostgreSQL):

select max(gdp_per_capita)
from countries
group by ();

(but beware, this isn’t always the case – read this interesting article by Glenn Paulley for details)

The awesome GROUPING SETs

In this section of the article, we’ll be leaving PostgreSQL land, entering SQL Server land, as PostgreSQL shamefully doesn’t implement any of the following (yet).

Now, we cannot understand the grand total (empty GROUP BY ( ) clause), without having a short look at the SQL:1999 standard GROUPING SETS. Some of you may have heard of CUBE() or ROLLUP() grouping functions, which are just syntactic sugar for commonly used GROUPING SETS. Let’s try to answer this question in a single query:

What are the highest GDP per capita values per year OR per country

In SQL, we’ll write:

select code, year, max(gdp_per_capita)
from countries
group by grouping sets ((code), (year))

Which yields two concatenated sets of records:

code    year    max
------------------------
NULL    2009    46999.00 <- grouped by year
NULL    2010    48358.00
NULL    2011    51791.00
NULL    2012    52409.00

CA      NULL    52409.00 <- grouped by code
DE      NULL    44355.00
FR      NULL    42578.00
GB      NULL    38927.00
IT      NULL    36988.00
JP      NULL    46548.00
RU      NULL    14091.00
US      NULL    51755.00

That’s kind of nice, isn’t it? It’s essentially just the same thing as this query with UNION ALL

select code, null, max(gdp_per_capita)
from countries
group by code
union all
select null, year, max(gdp_per_capita)
from countries
group by year;

In fact, it’s exactly the same thing, as the latter explicitly concatenates two sets of grouped records… i.e. two GROUPING SETS. This SQL Server documentation page also explains it very nicely.

And the most powerful of them all: CUBE()

Now, imagine, you’d like to add the “grand total”, and also the highest value per country AND year, producing four different concatenated sets. To limit the results, we’ll also filter out GDPs of less than 48000 for this example:

select 
  code, year, max(gdp_per_capita), 
  grouping_id(code, year) grp
from countries
where gdp_per_capita >= 48000
group by grouping sets (
  (), 
  (code), 
  (year), 
  (code, year)
)
order by grp desc;

This nice-looking query will now produce all the possible grouping combinations that we can imagine, including the grand total, in order to produce:

code    year    max         grp
---------------------------------
NULL    NULL    52409.00    3 <- grand total

NULL    2012    52409.00    2 <- group by year
NULL    2010    48358.00    2
NULL    2011    51791.00    2

CA      NULL    52409.00    1 <- group by code
US      NULL    51755.00    1

US      2010    48358.00    0 <- group by code and year
CA      2012    52409.00    0
US      2012    51755.00    0
CA      2011    51791.00    0
US      2011    49855.00    0

And because this is quite a common operation in reporting and in OLAP, we can simply write the same by using the CUBE() function:

select 
  code, year, max(gdp_per_capita), 
  grouping_id(code, year) grp
from countries
where gdp_per_capita >= 48000
group by cube(code, year)
order by grp desc;

Compatibility

While the first couple of queries also worked on PostgreSQL, the ones that are using GROUPING SETS will work only on 4 out of 17 RDBMS currently supported by jOOQ. These are:

  • DB2
  • Oracle
  • SQL Server
  • Sybase SQL Anywhere

jOOQ also fully supports the previously mentioned syntaxes. The GROUPING SETS variant can be written as such:

// Countries is an object generated by the jOOQ
// code generator for the COUNTRIES table.
Countries c = COUNTRIES;

ctx.select(
       c.CODE,
       c.YEAR,
       max(c.GDP_PER_CAPITA),
       groupingId(c.CODE, c.YEAR).as("grp"))
   .from(c)
   .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000")))
   .groupBy(groupingSets(new Field[][] {
       {},
       { c.CODE },
       { c.YEAR },
       { c.CODE, c.YEAR }
   }))
   .orderBy(fieldByName("grp").desc())
   .fetch();

… or the CUBE() version:

ctx.select(
       c.CODE,
       c.YEAR,
       max(c.GDP_PER_CAPITA),
       groupingId(c.CODE, c.YEAR).as("grp"))
   .from(c)
   .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000")))
   .groupBy(cube(c.CODE, c.YEAR))
   .orderBy(fieldByName("grp").desc())
   .fetch();

jOOQ is the best way to write SQL in Java

… and in the future, we’ll emulate GROUPING SETS by their equivalent UNION ALL queries in those databases that do not natively support GROUPING SETS.

Try it out for yourself and download your free jOOQ trial now!

Awesome SQL Trick: Constraints on Views

CHECK constraints are already pretty great when you want to sanitize your data. But there are some limitations to CHECK constraints, including the fact that they are applied to the table itself, when sometimes, you want to specify constraints that only apply in certain situations.

This can be done with the SQL standard WITH CHECK OPTION clause, which is implemented by at least Oracle and SQL Server. Here’s how to do that:

CREATE TABLE books (
  id    NUMBER(10)         NOT NULL,
  title VARCHAR2(100 CHAR) NOT NULL,
  price NUMBER(10, 2)      NOT NULL,
  
  CONSTRAINT pk_book PRIMARY KEY (id)
);
/

CREATE VIEW expensive_books
AS
SELECT id, title, price
FROM books
WHERE price > 100
WITH CHECK OPTION;
/

INSERT INTO books 
VALUES (1, '1984', 35.90);

INSERT INTO books 
VALUES (
  2, 
  'The Answer to Life, the Universe, and Everything',
  999.90
);

As you can see, expensive_books are all those books whose price is more than 100.00. This view will only report the second book:

SELECT * FROM expensive_books;

The above query yields:

ID TITLE                                       PRICE
-- ----------------------------------------- -------
 2 The Answer to Life, the Universe, and ...   999.9 

But now, that we have that CHECK OPTION, we can also prevent users from inserting “expensive books” that aren’t really expensive. For instance, let’s run this query:

INSERT INTO expensive_books 
VALUES (3, '10 Reasons why jOOQ is Awesome', 9.99);

This query won’t work now. We’re getting:

ORA-01402: view WITH CHECK OPTION where-clause violation

We also cannot update any of the “expensive books” to be non-expensive:

UPDATE expensive_books
SET price = 9.99;

This query results in the same ORA-01402 error message.

Inline WITH CHECK OPTION

In case you need to locally prevent bogus data from being inserted into a table, you can also use inline WITH CHECK OPTION clauses like so:

INSERT INTO (
  SELECT *
  FROM expensive_books
  WHERE price > 1000
  WITH CHECK OPTION
) really_expensive_books
VALUES (3, 'Modern Enterprise Software', 999.99);

And the above query again resutls in an ORA-01402 error.

Using SQL transformation to generate ad-hoc constraints

While CHECK OPTION is very useful for stored views, which can have proper grants for those users that may not access the underlying table directly, the inline CHECK OPTION is mainly useful when you transform dynamic SQL in an intermediate SQL transformation layer in your applciation.

This can be done with jOOQ’s SQL transformation capabilities, for instance, where you can watch out for a certain table in your SQL statements, and then centrally prevent bogus DML from being executed. This is a great way to implement multi-tenancy, if your database doesn’t natively support row-level security.

Stay tuned for a future blog post explaining how to transform your SQL with jOOQ to implement row-level security for any database.

Are You Using SQL PIVOT Yet? You Should!

Every once in a while, we run into these rare SQL issues where we’d like to do something that seems out of the ordinary. One of these things is pivoting rows to columns.

A recent question on Stack Overflow by Valiante asked for precisely this. Going from this table:

+------+------------+----------------+-------------------+
| dnId | propNameId |  propertyName  |   propertyValue   |
+------+------------+----------------+-------------------+
|    1 |         10 | objectsid      | S-1-5-32-548      |
|    1 |         19 | _objectclass   | group             |
|    1 |         80 | cn             | Account Operators |
|    1 |         82 | samaccountname | Account Operators |
|    1 |         85 | name           | Account Operators |
|    2 |         10 | objectsid      | S-1-5-32-544      |
|    2 |         19 | _objectclass   | group             |
|    2 |         80 | cn             | Administrators    |
|    2 |         82 | samaccountname | Administrators    |
|    2 |         85 | name           | Administrators    |
|    3 |         10 | objectsid      | S-1-5-32-551      |
|    3 |         19 | _objectclass   | group             |
|    3 |         80 | cn             | Backup Operators  |
|    3 |         82 | samaccountname | Backup Operators  |
|    3 |         85 | name           | Backup Operators  |
+------+------------+----------------+-------------------+

… we’d like to transform rows into colums as such:

+------+--------------+--------------+-------------------+-------------------+-------------------+
| dnId |  objectsid   | _objectclass |        cn         |  samaccountname   |       name        |
+------+--------------+--------------+-------------------+-------------------+-------------------+
|    1 | S-1-5-32-548 | group        | Account Operators | Account Operators | Account Operators |
|    2 | S-1-5-32-544 | group        | Administrators    | Administrators    | Administrators    |
|    3 | S-1-5-32-551 | group        | Backup Operators  | Backup Operators  | Backup Operators  |
+------+--------------+--------------+-------------------+-------------------+-------------------+

The idea is that we only want one row per distinct dnId, and then we’d like to transform the property-name-value pairs into columns, one column per property name.

Using Oracle or SQL Server PIVOT

The above transformation is actually quite easy with Oracle and SQL Server, which both support the PIVOT keyword on table expressions.

Here is how the desired result can be produced with SQL Server:

SELECT p.*
FROM (
  SELECT dnId, propertyName, propertyValue
  FROM myTable
) AS t
PIVOT(
  MAX(propertyValue)
  FOR propertyName IN (
    objectsid, 
    _objectclass, 
    cn, 
    samaccountname, 
    name
  )
) AS p;

(SQLFiddle here)

And the same query with a slightly different syntax in Oracle:

SELECT p.*
FROM (
  SELECT dnId, propertyName, propertyValue
  FROM myTable
) t
PIVOT(
  MAX(propertyValue)
  FOR propertyName IN (
    'objectsid'      as "objectsid", 
    '_objectclass'   as "_objectclass", 
    'cn'             as "cn", 
    'samaccountname' as "samaccountname", 
    'name'           as "name"
  )
) p;

(SQLFiddle here)

How does it work?

It is important to understand that PIVOT (much like JOIN) is a keyword that is applied to a table reference in order to transform it. In the above example, we’re essentially transforming the derived table t to form the pivot table p. We could take this further and join p to another derived table as so:

SELECT *
FROM (
  SELECT dnId, propertyName, propertyValue
  FROM myTable
) t
PIVOT(
  MAX(propertyValue)
  FOR propertyName IN (
    'objectsid'      as "objectsid", 
    '_objectclass'   as "_objectclass", 
    'cn'             as "cn", 
    'samaccountname' as "samaccountname", 
    'name'           as "name"
  )
) p
JOIN (
  SELECT dnId, COUNT(*) availableAttributes
  FROM myTable
  GROUP BY dnId
) q USING (dnId);

The above query will now allow for finding those rows for which there isn’t a name / value pair in every column. Let’s assume we remove one of the entries from the original table, the above query might now return:

| DNID |    OBJECTSID | _OBJECTCLASS |                CN |    SAMACCOUNTNAME |              NAME | AVAILABLEATTRIBUTES |
|------|--------------|--------------|-------------------|-------------------|-------------------|---------------------|
|    1 | S-1-5-32-548 |        group | Account Operators | Account Operators | Account Operators |                   5 |
|    2 | S-1-5-32-544 |        group |    Administrators |            (null) |    Administrators |                   4 |
|    3 | S-1-5-32-551 |        group |  Backup Operators |  Backup Operators |  Backup Operators |                   5 |

jOOQ also supports the SQL PIVOT clause through its API.

What if I don’t have PIVOT?

In simple PIVOT scenarios, users of other databases than Oracle or SQL Server can write an equivalent query that uses GROUP BY and MAX(CASE ...) expressions as documented in this answer here.

SQL Server ALTER TABLE SET DEFAULT

Most databases that support default values on their column DDL, it is also possible to actually alter that default. An Oracle example:

CREATE TABLE t (
  val NUMBER(7) DEFAULT 1 NOT NULL
);

-- Oops, wrong default, let us change it
ALTER TABLE t MODIFY val DEFAULT -1;

-- Now that is better

Unfortunately, this isn’t possible in SQL Server, where the DEFAULT column property is really a constraint, and probably a constraint whose name you don’t know because it was system generated.

But luckily, jOOQ 3.4 now supports DDL and can abstract this information away from you by generating the following Transact-SQL program:

DECLARE @constraint NVARCHAR(max);
DECLARE @command NVARCHAR(max);

SELECT @constraint = name
FROM sys.default_constraints
WHERE parent_object_id = object_id('t')
AND parent_column_id = columnproperty(
    object_id('t'), 'val', 'ColumnId');

IF @constraint IS NOT NULL
BEGIN
  SET @command = 'ALTER TABLE t DROP CONSTRAINT ' 
    + @constraint;
  EXECUTE sp_executeSQL @command

  SET @command = 'ALTER TABLE t ADD CONSTRAINT ' 
    + @constraint + ' DEFAULT -1 FOR val';
  EXECUTE sp_executeSQL @command
END
ELSE
BEGIN
  SET @command = 'ALTER TABLE t ADD DEFAULT -1 FOR val';
  EXECUTE sp_executeSQL @command
END

This program will either drop and create a new constraint with the same name, or create an entirely new constraint with a system-generated name.

With jOOQ, you can execute this statement as such:

DSL.using(configuration)
   .alterTable(T)
   .alter(T.VAL)
   .defaultValue(-1)
   .execute();

SQL Server Trick: Circumvent Missing ORDER BY Clause

SQL Server is known to have a very strict interpretation of the SQL standard. For instance, the following expressions or statements are not possible in SQL Server:

-- Get arbitrarily numbered row_numbers
SELECT ROW_NUMBER() OVER ()

-- Skip arbitrary rows
SELECT a
FROM (VALUES (1), (2), (3), (4)) t(a)
OFFSET 3 ROWS

Strictly speaking, that limitation makes sense because the above ROW_NUMBER() or OFFSET expressions are non-deterministic. Two subsequent executions of the same query might produce different results. But then again, any ORDER BY clause is non-deterministic, if you do not order by a strictly UNIQUE expression, such as a primary key.

So, that’s a bit of a pain, because other databases aren’t that strict and after all, you might just not care about explicit ordering for a quick, ad-hoc query, so a “reasonable”, lenient default would be useful.

Constant ORDER BY clauses don’t work

You cannot add a constant ORDER BY clause to window functions either. I.e.:

-- This doesn't work:
SELECT ROW_NUMBER() OVER (ORDER BY 'a')

-- But this does!
SELECT a
FROM (VALUES (1), (2), (3), (4)) t(a)
ORDER BY 'a'
OFFSET 3 ROWS

Note that ORDER BY 'a' uses a constant VARCHAR expression, not a numeric one, as that would be generating column-reference-by-index expressions, which would be non-constant in the second example.

Random column references don’t work

So you’re thinking that you can just add a random column reference? Sometimes you can, but often you cannot:

-- This doesn't work:
SELECT ROW_NUMBER() OVER (
  ORDER BY [no-column-available-here]
)

-- But this does!
SELECT a
FROM (VALUES (1), (2), (3), (4)) t(a)
ORDER BY a
OFFSET 3 ROWS

The above examples show that you do not always have a column reference available in any given SQL expression. There is no useful column that you could refer to from the ROW_NUMBER() function. At the same time, you can write ORDER BY a in the second example, but only if a is a “comparable” value, i.e. not a LOB, such as text or image.

Besides, as we don’t really care about the actual ordering, is it worth ordering the result set by anything at all? Do you happen to have an index on a?

Quasi-constant ORDER BY expressions do work

So, to stay on the safe side, if ever you need a dummy ORDER BY expression in SQL Server, use a quasi-constant expression, like @@version (or @@language, or any of these). The following will always work:

-- This always works:
SELECT ROW_NUMBER() OVER (ORDER BY @@version)

-- So does this:
SELECT a
FROM (VALUES (1), (2), (3), (4)) t(a)
ORDER BY @@version
OFFSET 3 ROWS

From the upcoming jOOQ 3.4, we’ll also generate such synthetic ORDER BY clauses that will help you simplify writing vendor-agnostic SQL in these edge-cases, as we believe that you simply shouldn’t think of these things all the time.

The DBMS of the Year 2013

We have recently blogged about the DB-Engines Ranking and how MongoDB was the only NoSQL store to make it into that ranking’s top 10. Today, this marketing platform offered by solid IT has announced MongoDB to be the DBMS of the year 2013, with PostgreSQL being a close runner-up, followed by Cassandra.

solid IT as a company is slightly biased towards NoSQL, so it’s not surprising that two NoSQL databases are in their top ranking, and the only successful ORDBMS in the market is number two. As we ourselves are “slightly” biased towards SQL, we would like to announce our own DBMS of the year 2013:

SQL Server is the DBMS of the year 2013

… because its SQL dialect Transact-SQL (which Microsoft “shares” with Sybase), is the first SQL-based programming language to make it into TIOBE’s top 10 programming languages.

Congratulations to SQL Server from the jOOQ team!

Add LATERAL Joins or CROSS APPLY to Your SQL Tool Chain

The T-SQL dialect has known the powerful CROSS APPLY and OUTER APPLY JOIN syntaxes for ages. The SQL:1999 standard had introduced almost equivalent “lateral derived tables”, which are finally supported with PostgreSQL 9.3, or Oracle 12c, which has adopted both the SQL standard LATERAL syntax and the T-SQL vendor-specific CROSS APPLY and OUTER APPLY syntaxes.

But what are we even talking about?

SQL features have a unique trait that few other languages have. They are obscure to those who don’t know them, as every language feature introduces a new syntax with new keywords. In this case: APPLY and LATERAL. But it really isn’t so hard to understand. All you do with a CROSS APPLY is a CROSS JOIN between two tables where the right-hand side of the join expression can reference columns from the left-hand side of the join expression. Consider the following example by Martin Smith on Stack Overflow:

Reusing column aliases

SELECT number,
       doubled_number,
       doubled_number_plus_one
FROM master..spt_values
CROSS APPLY (
  SELECT 2 * CAST(number AS BIGINT)
) CA1(doubled_number)
CROSS APPLY (
  SELECT doubled_number + 1
) CA2(doubled_number_plus_one)

See a SQLFiddle of the above example

In this example, we’re selecting numbers from a system table and cross apply a scalar subselect multiplying each number by two. Then to the whole table product, we cross apply another scalar subselect, adding one to the last number.

This particular example could also be implemented using subqueries in the SELECT clause. But as you can see in the above example, doubled_number_plus_one can be calculated from a previously calculated column in one go. That wouldn’t be so “simple” with subqueries.

Applying table-valued functions to each record

SELECT *
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle)

This example may be even more useful when you want to join a table-valued function to each record of another table.

PostgreSQL’s LATERAL derived tabels

In PostgreSQL, this can be done somewhat magically by put-ting table-valued functions in the SELECT clause:

SELECT x, GENERATE_SERIES(0, x)
FROM (VALUES(0), (1), (2)) t(x)

See a SQLFiddle of the above example

The above yields

| X | GENERATE_SERIES |
|---|-----------------|
| 0 |               0 |
| 1 |               0 |
| 1 |               1 |
| 2 |               0 |
| 2 |               1 |
| 2 |               2 |

Alternatively, since PostgreSQL 9.3, you can use an explicit lateral derived table as such:

SELECT x, y
FROM (VALUES(0), (1), (2)) t(x),
LATERAL GENERATE_SERIES(0, t.x) u(y)

See a SQLFiddle of the above example

Yielding again

| X | Y |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 2 |

CROSS APPLY and OUTER APPLY in jOOQ 3.3

The above clauses will also be supported in the upcoming editions of jOOQ 3.3 where you can write queries like this one here:

DSL.using(configuration)
   .select()
   .from(AUTHOR)
   .crossApply(
        select(count().as("c"))
       .from(BOOK)
       .where(BOOK.AUTHOR_ID.eq(AUTHOR.ID)))
   .fetch();

Or lateral joins:

DSL.using(configuration)
   .select()
   .from(
        values(row(0), row(1), row(2))
            .as("t", "x"),
        lateral(generateSeries(0,
                fieldByName("t", "x"))
            .as("u", "y")))
   .fetch();

No matter if you’re using jOOQ or native SQL, lateral derived tables or CROSS APPLY should definitely be part of your awesome SQL tool chain!