How to Reduce Syntactic Overhead Using the SQL WINDOW Clause

SQL is a verbose language, and one of the most verbose features are window functions.

In a stack overflow question that I’ve encountered recently, someone asked to calculate the difference between the first and the last value in a time series for any given day:

Input

volume  tstamp
---------------------------
29011   2012-12-28 09:00:00
28701   2012-12-28 10:00:00
28830   2012-12-28 11:00:00
28353   2012-12-28 12:00:00
28642   2012-12-28 13:00:00
28583   2012-12-28 14:00:00
28800   2012-12-29 09:00:00
28751   2012-12-29 10:00:00
28670   2012-12-29 11:00:00
28621   2012-12-29 12:00:00
28599   2012-12-29 13:00:00
28278   2012-12-29 14:00:00

Desired output

first  last   difference  date
------------------------------------
29011  28583  428         2012-12-28
28800  28278  522         2012-12-29

How to write the query?

Notice that the value and timestamp progression do not correlate as it may appear. So, there is not a rule that if Timestamp2 > Timestamp1 then Value2 < Value1. Otherwise, this simple query would work (using PostgreSQL syntax):

SELECT 
  max(volume)               AS first,
  min(volume)               AS last,
  max(volume) - min(volume) AS difference,
  CAST(tstamp AS DATE)      AS date
FROM t
GROUP BY CAST(tstamp AS DATE);

There are several ways to find the first and last values within a group that do not involve window functions. For example:

  • In Oracle, you can use the FIRST and LAST functions, which for some arcane reason are not written FIRST(...) WITHIN GROUP (ORDER BY ...) or LAST(...) WITHIN GROUP (ORDER BY ...), like other sorted set aggregate functions, but some_aggregate_function(...) KEEP (DENSE_RANK FIRST ORDER BY ...). Go figure
  • In PostgreSQL, you could use the DISTINCT ON syntax along with ORDER BY and LIMIT

More details about the various approaches can be found here:
https://blog.jooq.org/2017/09/22/how-to-write-efficient-top-n-queries-in-sql

The best performing approach would be to use an aggregate function like Oracle’s, but few databases have this function. So, we’ll resort to using the FIRST_VALUE and LAST_VALUE window functions:

SELECT DISTINCT
  first_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS first,
  last_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS last,
  first_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) 
  - last_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
ORDER BY CAST(tstamp AS DATE)

Oops 🤔

That doesn’t look too readable. But it will yield the correct result. Granted, we could wrap the definition for the columns FIRST and LAST in a derived table, but that would still leave us with two repetitions of the window definition:

PARTITION BY CAST(tstamp AS DATE) 
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

WINDOW clause to the rescue

Luckily, at least 3 databases have implemented the SQL standard WINDOW clause:

  • MySQL
  • PostgreSQL
  • Sybase SQL Anywhere

The above query can be refactored to this one:

SELECT DISTINCT
  first_value(volume) OVER w AS first,
  last_value(volume) OVER w AS last,
  first_value(volume) OVER w 
    - last_value(volume) OVER w AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW w AS (
  PARTITION BY CAST(tstamp AS DATE) 
  ORDER BY tstamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY CAST(tstamp AS DATE)

Notice how I can specify a window name with a window specification in a similar way as I can define a common table expression (WITH clause):

WINDOW 
    <window-name> AS (<window-specification>)
{  ,<window-name> AS (<window-specification>)... }

Not only can I reuse entire specifications, I could also build a specification from a partial specification, and reuse only parts. My previous query could have been rewritten as such:

SELECT DISTINCT
  first_value(volume) OVER w3 AS first,
  last_value(volume) OVER w3 AS last,
  first_value(volume) OVER w3 
    - last_value(volume) OVER w3 AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW 
  w1 AS (PARTITION BY CAST(tstamp AS DATE)),
  w2 AS (w1 ORDER BY tstamp),
  w3 AS (w2 ROWS BETWEEN UNBOUNDED PRECEDING 
                     AND UNBOUNDED FOLLOWING)
ORDER BY CAST(tstamp AS DATE)

Each window specification can be created from scratch, or be based on a previously defined window specification. Note this is also true when referencing the window definition. If I wanted to reuse the PARTITION BY clause and the ORDER BY clause, but change the FRAME clause (ROWS ...), then I could have written this:

SELECT DISTINCT
  first_value(volume) OVER (
    w2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS first,
  last_value(volume) OVER (
    w2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  ) AS last,
  first_value(volume) OVER (
    w2 ROWS UNBOUNDED PRECEDING
  ) - last_value(volume) OVER (
    w2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
  ) AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW 
  w1 AS (PARTITION BY CAST(tstamp AS DATE)),
  w2 AS (w1 ORDER BY tstamp)
ORDER BY CAST(tstamp AS DATE)

What if my database doesn’t support the WINDOW clause?

In that case, you have to either manually write the window specification on each window function, or you use a SQL builder like jOOQ, which can emulate the window clause:

You can try this translation online on our website: https://www.jooq.org/translate

SAP’s Hilarious SQL Whitepaper(s)

While looking for some authoritative information about Sybase SQL Anywhere 12’s TOP .. START AT clause, I stumbled upon this hilarious white paper here, which I do not want to keep from you:
http://www.sybase.com/files/White_Papers/Sybase_Top_10_Features_In_SQL_Anywhere_12.pdf

I will take advantage of “fair use policy” and cite parts from section 7:

Feature number 7: improved support for DaffySQL syntax

If I told you that RowGenerator.row_num contains the values 1 through 255, what would you say this query returned?

[Query example]

Give up? OK, how about this one?

[Query example]

Still stumped? If I told you they both returned exactly the same result set as the following query, what would you say?

[Query example]

Yes, the LIMIT clause is new to SQL Anywhere 12, exactly the same as TOP START AT except it uses zero as the starting point for numbering rows instead of 1.

An “offset”, get it?

As in “Here’s ten dollars, let me count it for you: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.”

Why implement LIMIT? And why include it in a list of cool features?

Because there are a lot of MySQL users out there who don’t have TOP START AT, and they’ve written zillions of queries using LIMIT, and they’d like to migrate their apps to SQL Anywhere without rewriting everything. And PostgreSQL users too… welcome aboard!

Migrating to SQL Anywhere is definitely cool.

So be cool and migrate to SQL Anywhere already! :-) I’m now going through the rest of this fun document.

GROUP BY ROLLUP / CUBE

Every now and then, you come across a requirement that will bring you to your SQL limits. Many of us probably give up early and calculate stuff in Java / [or your language]. Instead, it might’ve been so easy and fast to do with SQL. If you’re working with an advanced database, such as DB2, Oracle, SQL Server, Sybase SQL Anywhere, (and MySQL in this case, which supports the WITH ROLLUP clause), you can take advantage of the ROLLUP / CUBE / GROUPING SETS grouping functions.

Lets have a look at my fictional salary progression compared to that of a fictional friend, who has chosen a different career path (observe the salary boost in 2011):

select 'Lukas'      as employee, 
       'SoftSkills' as company, 
	   80000        as salary, 
	   2007         as year 
from dual
union all select 'Lukas', 'SoftSkills', 80000,  2008 from dual
union all select 'Lukas', 'SmartSoft',  90000,  2009 from dual
union all select 'Lukas', 'SmartSoft',  95000,  2010 from dual
union all select 'Lukas', 'jOOQ',       200000, 2011 from dual
union all select 'Lukas', 'jOOQ',       250000, 2012 from dual
union all select 'Tom',   'SoftSkills', 89000,  2007 from dual
union all select 'Tom',   'SoftSkills', 90000,  2008 from dual
union all select 'Tom',   'SoftSkills', 91000,  2009 from dual
union all select 'Tom',   'SmartSoft',  92000,  2010 from dual
union all select 'Tom',   'SmartSoft',  93000,  2011 from dual
union all select 'Tom',   'SmartSoft',  94000,  2012 from dual

Now we’re used to gathering statistics using simple grouping and simple aggregate functions. For instance, let’s calculate how much Lukas and Tom earned on average over the past few years:

with data as ([above select])
select employee, avg(salary)
from data
group by employee

This will show that Lukas has earned more:

+--------+-----------+
|EMPLOYEE|AVG(SALARY)|
+--------+-----------+
|Lukas   |     132500|
|Tom     |      91500|
+--------+-----------+

So it’s probably interesting to find out what they have earned on average in which company:

with data as (...)
select company, employee, avg(salary)
from data
group by company, employee
order by company, employee

And immediately, it becomes clear where the big bucks are and that Tom has made a bad decision ;-)

+----------+--------+-----------+
|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----------+--------+-----------+
|jOOQ      |Lukas   |     225000|
|SmartSoft |Lukas   |      92500|
|SmartSoft |Tom     |      93000|
|SoftSkills|Lukas   |      80000|
|SoftSkills|Tom     |      90000|
+----------+--------+-----------+

ROLLUP

By adding grouping fields, we “lose” some aggregation information. In the above examples, the overall average salary per employee is no longer available directly from the result. That’s obvious, considering the grouping algorithm. But in nice-looking reports, we often want to display those grouping headers as well. This is where ROLLUP, CUBE (and GROUPING SETS) come into play. Consider the following query:

with data as (...)
select company, employee, avg(salary)
from data
group by rollup(company), employee

The above rollup function will now add additional rows to the grouping result set, holding useful aggregated values. In this case, when we “roll up the salaries of the company”, we will get the average of the remaining grouping fields, i.e. the average per employee:

+----------+--------+-----------+
|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----------+--------+-----------+
|SmartSoft |Tom     |      93000|
|SoftSkills|Tom     |      90000|
|{null}    |Tom     |      91500|
|jOOQ      |Lukas   |     225000|
|SmartSoft |Lukas   |      92500|
|SoftSkills|Lukas   |      80000|
|{null}    |Lukas   |     132500|
+----------+--------+-----------+

Note how these rows hold the same information as the ones from the first query, where we were only grouping by employee… This becomes even more interesting, when we put more grouping fields into the rollup function:

with data as (...)
select company, employee, avg(salary)
from data
group by rollup(employee, company)

As you can see, the order of grouping fields is important in the rollup function. The result from this query now also adds the overall average salary paid to all employees in all companies

+----------+--------+-----------+
|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----------+--------+-----------+
|SmartSoft |Tom     |      93000|
|SoftSkills|Tom     |      90000|
|{null}    |Tom     |      91500|
|jOOQ      |Lukas   |     225000|
|SmartSoft |Lukas   |      92500|
|SoftSkills|Lukas   |      80000|
|{null}    |Lukas   |     132500|
|{null}    |{null}  |     112000|
+----------+--------+-----------+

In order to identify the totals rows for reporting, you can use the GROUPING() function in DB2, Oracle, SQL Server and Sybase SQL Anywhere. In Oracle and SQL Server, there’s the even more useful GROUPING_ID() function:

with data as (...)
select grouping_id(employee, company) id, company, employee, avg(salary)
from data
group by rollup(employee, company)

It documents on what “grouping level” of the rollup function the current row was produced:

+----+----------+--------+-----------+
|  ID|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----+----------+--------+-----------+
|   0|SmartSoft |Tom     |      93000|
|   0|SoftSkills|Tom     |      90000|
|   1|{null}    |Tom     |      91500|
|   0|jOOQ      |Lukas   |     225000|
|   0|SmartSoft |Lukas   |      92500|
|   0|SoftSkills|Lukas   |      80000|
|   1|{null}    |Lukas   |     132500|
|   3|{null}    |{null}  |     112000|
+----+----------+--------+-----------+

CUBE

The cube function works similar, except that the order of cube grouping fields becomes irrelevant, as all combinations of grouping are combined. This is a bit tricky to put in words, so lets put it in action:

with data as (...)
select grouping_id(employee, company) id, company, employee, avg(salary)
from data
group by cube(employee, company)

In the following result, you will get:

  • GROUPING_ID() = 0: Average per company and employee. This is the normal grouping result
  • GROUPING_ID() = 1: Average per employee
  • GROUPING_ID() = 2: Average per company
  • GROUPING_ID() = 3: Overall average
+----+----------+--------+-----------+
|  ID|COMPANY   |EMPLOYEE|AVG(SALARY)|
+----+----------+--------+-----------+
|   3|{null}    |{null}  |     112000|
|   2|jOOQ      |{null}  |     225000|
|   2|SmartSoft |{null}  |      92800|
|   2|SoftSkills|{null}  |      86000|
|   1|{null}    |Tom     |      91500|
|   0|SmartSoft |Tom     |      93000|
|   0|SoftSkills|Tom     |      90000|
|   1|{null}    |Lukas   |     132500|
|   0|jOOQ      |Lukas   |     225000|
|   0|SmartSoft |Lukas   |      92500|
|   0|SoftSkills|Lukas   |      80000|
+----+----------+--------+-----------+

In other words, using the CUBE() function, you will get grouping results for every possible combination of the grouping fields supplied to the CUBE() function, which results in 2^n GROUPING_ID()’s for n “cubed” grouping fields

Support in jOOQ

jOOQ 2.0 introduces support for these functions. If you want to translate the last select into jOOQ, you’d roughly get this Java code:

// assuming that DATA is an actual table...
create.select(
         groupingId(DATA.EMPLOYEE, DATA.COMPANY).as("id"),
         DATA.COMPANY, DATA.EMPLOYEE, avg(SALARY))
      .from(DATA)
      .groupBy(cube(DATA.EMPLOYEE, DATA.COMPANY));

With this powerful tool, you’re ready for all of those fancy reports and data overviews. For more details, read on about ROLLUP(), CUBE(), and GROUPING SETS() functions on the SQL Server documentation page, which explains it quite nicely:

http://msdn.microsoft.com/en-us/library/bb522495.aspx