How to Reduce Syntactic Overhead Using the SQL WINDOW Clause

SQL is a verbose language, and one of the most verbose features are window functions.

In a stack overflow question that I’ve encountered recently, someone asked to calculate the difference between the first and the last value in a time series for any given day:

Input

volume  tstamp
---------------------------
29011   2012-12-28 09:00:00
28701   2012-12-28 10:00:00
28830   2012-12-28 11:00:00
28353   2012-12-28 12:00:00
28642   2012-12-28 13:00:00
28583   2012-12-28 14:00:00
28800   2012-12-29 09:00:00
28751   2012-12-29 10:00:00
28670   2012-12-29 11:00:00
28621   2012-12-29 12:00:00
28599   2012-12-29 13:00:00
28278   2012-12-29 14:00:00

Desired output

first  last   difference  date
------------------------------------
29011  28583  428         2012-12-28
28800  28278  522         2012-12-29

How to write the query?

Notice that the value and timestamp progression do not correlate as it may appear. So, there is not a rule that if Timestamp2 > Timestamp1 then Value2 < Value1. Otherwise, this simple query would work (using PostgreSQL syntax):

SELECT 
  max(volume)               AS first,
  min(volume)               AS last,
  max(volume) - min(volume) AS difference,
  CAST(tstamp AS DATE)      AS date
FROM t
GROUP BY CAST(tstamp AS DATE);

There are several ways to find the first and last values within a group that do not involve window functions. For example:

  • In Oracle, you can use the FIRST and LAST functions, which for some arcane reason are not written FIRST(...) WITHIN GROUP (ORDER BY ...) or LAST(...) WITHIN GROUP (ORDER BY ...), like other sorted set aggregate functions, but some_aggregate_function(...) KEEP (DENSE_RANK FIRST ORDER BY ...). Go figure
  • In PostgreSQL, you could use the DISTINCT ON syntax along with ORDER BY and LIMIT

More details about the various approaches can be found here:
https://blog.jooq.org/2017/09/22/how-to-write-efficient-top-n-queries-in-sql

The best performing approach would be to use an aggregate function like Oracle’s, but few databases have this function. So, we’ll resort to using the FIRST_VALUE and LAST_VALUE window functions:

SELECT DISTINCT
  first_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS first,
  last_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS last,
  first_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) 
  - last_value(volume) OVER (
    PARTITION BY CAST(tstamp AS DATE) 
    ORDER BY tstamp
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
ORDER BY CAST(tstamp AS DATE)

Oops 🤔

That doesn’t look too readable. But it will yield the correct result. Granted, we could wrap the definition for the columns FIRST and LAST in a derived table, but that would still leave us with two repetitions of the window definition:

PARTITION BY CAST(tstamp AS DATE) 
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

WINDOW clause to the rescue

Luckily, at least 3 databases have implemented the SQL standard WINDOW clause:

  • MySQL
  • PostgreSQL
  • Sybase SQL Anywhere

The above query can be refactored to this one:

SELECT DISTINCT
  first_value(volume) OVER w AS first,
  last_value(volume) OVER w AS last,
  first_value(volume) OVER w 
    - last_value(volume) OVER w AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW w AS (
  PARTITION BY CAST(tstamp AS DATE) 
  ORDER BY tstamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY CAST(tstamp AS DATE)

Notice how I can specify a window name with a window specification in a similar way as I can define a common table expression (WITH clause):

WINDOW 
    <window-name> AS (<window-specification>)
{  ,<window-name> AS (<window-specification>)... }

Not only can I reuse entire specifications, I could also build a specification from a partial specification, and reuse only parts. My previous query could have been rewritten as such:

SELECT DISTINCT
  first_value(volume) OVER w3 AS first,
  last_value(volume) OVER w3 AS last,
  first_value(volume) OVER w3 
    - last_value(volume) OVER w3 AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW 
  w1 AS (PARTITION BY CAST(tstamp AS DATE)),
  w2 AS (w1 ORDER BY tstamp),
  w3 AS (w2 ROWS BETWEEN UNBOUNDED PRECEDING 
                     AND UNBOUNDED FOLLOWING)
ORDER BY CAST(tstamp AS DATE)

Each window specification can be created from scratch, or be based on a previously defined window specification. Note this is also true when referencing the window definition. If I wanted to reuse the PARTITION BY clause and the ORDER BY clause, but change the FRAME clause (ROWS ...), then I could have written this:

SELECT DISTINCT
  first_value(volume) OVER (
    w2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS first,
  last_value(volume) OVER (
    w2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  ) AS last,
  first_value(volume) OVER (
    w2 ROWS UNBOUNDED PRECEDING
  ) - last_value(volume) OVER (
    w2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
  ) AS diff,
  CAST(tstamp AS DATE) AS date
FROM t
WINDOW 
  w1 AS (PARTITION BY CAST(tstamp AS DATE)),
  w2 AS (w1 ORDER BY tstamp)
ORDER BY CAST(tstamp AS DATE)

What if my database doesn’t support the WINDOW clause?

In that case, you have to either manually write the window specification on each window function, or you use a SQL builder like jOOQ, which can emulate the window clause:

You can try this translation online on our website: https://www.jooq.org/translate

Use this Neat Window Function Trick to Calculate Time Differences in a Time Series

Whenever you feel that itch…

Can’t I calculate this with SQL?

The answer is: Yes you can! And you should! Let’s see how…

Calculating time differences between rows

Let’s consider the following database containing timestamps (e.g. in a log database). We’re using PostgreSQL syntax for this:

CREATE TABLE timestamps (
  ts timestamp
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0'),
  ('2015-05-01 12:15:24.0'),
  ('2015-05-01 12:15:27.0'),
  ('2015-05-01 12:15:31.0'),
  ('2015-05-01 12:15:40.0'),
  ('2015-05-01 12:15:55.0'),
  ('2015-05-01 12:16:01.0'),
  ('2015-05-01 12:16:03.0'),
  ('2015-05-01 12:16:04.0'),
  ('2015-05-01 12:16:04.0');

Obviously, you’ll be adding constraints and indexes, etc. Now, let’s assume that each individual timestamp represents an event in your system, and you’d like to keep track of how long ago the previous event has happened. I.e. you’d like the following result:

ts                   delta
-------------------------------
2015-05-01 12:15:23
2015-05-01 12:15:24  00:00:01
2015-05-01 12:15:27  00:00:03
2015-05-01 12:15:31  00:00:04
2015-05-01 12:15:40  00:00:09
2015-05-01 12:15:55  00:00:15
2015-05-01 12:16:01  00:00:06
2015-05-01 12:16:03  00:00:02
2015-05-01 12:16:04  00:00:01
2015-05-01 12:16:04  00:00:00

In other words

  • ts1 (12:15:23) + delta (00:00:01) = ts2 (12:15:24)
  • ts2 (12:15:24) + delta (00:00:03) = ts3 (12:15:27)

This can be achieved very easily with the LAG() window function:

SELECT 
  ts, 
  ts - lag(ts, 1) OVER (ORDER BY ts) delta
FROM timestamps
ORDER BY ts;

The above reads simply:

Give me the difference between the ts value of the current row and the ts value of the row that “lags” behind this row by one, with rows ordered by ts.

Easy, right? With LAG() you can actually access any row from another row within a “sliding window” by simply specifying the lag index.

We’ve already described this wonderful window function in a previous blog post.

Bonus: A running total interval

In addition to the difference between this timestamp and the previous one, we might be interested in the total difference between this timestamp and the first timestamp. This may sound like a running total (see our previous article about running totals using SQL), but it can be calculated much more easily using FIRST_VALUE() – a “cousin” of LAG()

SELECT 
  ts, 
  ts - lag(ts, 1) OVER w delta,
  ts - first_value(ts) OVER w total
FROM timestamps
WINDOW w AS (ORDER BY ts)
ORDER BY ts;

… the above query then yields

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:04
2015-05-01 12:15:31  00:00:04  00:00:08
2015-05-01 12:15:40  00:00:09  00:00:17
2015-05-01 12:15:55  00:00:15  00:00:32
2015-05-01 12:16:01  00:00:06  00:00:38
2015-05-01 12:16:03  00:00:02  00:00:40
2015-05-01 12:16:04  00:00:01  00:00:41
2015-05-01 12:16:04  00:00:00  00:00:41

Extra bonus: The total since a “reset” event

We can take this as far as we want. Let’s assume that we want to reset the total from time to time:

CREATE TABLE timestamps (
  ts timestamp,
  event varchar(50)
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0', null),
  ('2015-05-01 12:15:24.0', null),
  ('2015-05-01 12:15:27.0', 'reset'),
  ('2015-05-01 12:15:31.0', null),
  ('2015-05-01 12:15:40.0', null),
  ('2015-05-01 12:15:55.0', 'reset'),
  ('2015-05-01 12:16:01.0', null),
  ('2015-05-01 12:16:03.0', null),
  ('2015-05-01 12:16:04.0', null),
  ('2015-05-01 12:16:04.0', null);

We can now run the following query:

SELECT
  ts, 
  ts - lag(ts, 1) 
       OVER (ORDER BY ts) delta,
  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total
FROM (
  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps
) timestamps
ORDER BY ts;

… to produce

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:00 <-- reset
2015-05-01 12:15:31  00:00:04  00:00:04
2015-05-01 12:15:40  00:00:09  00:00:13
2015-05-01 12:15:55  00:00:15  00:00:00 <-- reset
2015-05-01 12:16:01  00:00:06  00:00:06
2015-05-01 12:16:03  00:00:02  00:00:08
2015-05-01 12:16:04  00:00:01  00:00:09
2015-05-01 12:16:04  00:00:00  00:00:09

The beautiful part is in the derived table

  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps

This derived table just adds the “partition” to each set of timestamps given the most recent “reset” event. The result of the above subquery is:

c  ts
----------------------
0  2015-05-01 12:15:23
0  2015-05-01 12:15:24
1  2015-05-01 12:15:27 <-- reset
1  2015-05-01 12:15:31
1  2015-05-01 12:15:40
2  2015-05-01 12:15:55 <-- reset
2  2015-05-01 12:16:01
2  2015-05-01 12:16:03
2  2015-05-01 12:16:04
2  2015-05-01 12:16:04

As you can see, the COUNT(*) window function counts all the previous “reset” events, ordered by timestamp. This information can then be used as the PARTITION for the FIRST_VALUE() window function in order to find the first timestamp in each partition, i.e. at the time of the most recent “reset” event:

  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total

Conclusion

It’s almost a running gag on this blog to say that…

There was SQL before window functions and SQL after window functions

Window functions are extremely powerful and they’re a part of the SQL standard, supported in most commercial databases, in PostgreSQL, in Firebird 3.0, and in CUBRID. If you aren’t using them already, start using them today!

If you’ve liked this article, find out more about window functions in any of the following articles:

Don’t Miss out on Awesome SQL Power with FIRST_VALUE(), LAST_VALUE(), LEAD(), and LAG()

If you’re using a commercial database or PostgreSQL / Firebird / CUBRID, you will be able to take advantage of the full power of window functions. We’ve blogged about window functions’ awesomeness a couple of times, in particular about ROW_NUMBER(), RANK(), DENSE_RANK().

Today, we’re going to look into some awesome window functions that produce values of other rows that are positioned before or after the current row.

Setting up the test data

We’re going to do some interesting statistics today using publicly available data from the World Bank. To keep things simple, we’ll only do analyses for the G8 countries:

  • Canada (CA)
  • France (FR)
  • Germany (DE)
  • Italy (IT)
  • Japan (JP)
  • Russian Federation (RU)
  • United Kingdom (GB)
  • United States (US)

And for those countries, let’s consider the following data points for the years 2009-2012:

GDP per capita (current US$)

          2009    2010    2011    2012
CA      40,764  47,465  51,791  52,409	
DE      40,270  40,408  44,355  42,598	
FR      40,488  39,448  42,578  39,759	
GB      35,455  36,573  38,927  38,649	
IT      35,724  34,673  36,988  33,814	
JP      39,473  43,118  46,204  46,548	
RU       8,616  10,710  13,324  14,091	
US      46,999  48,358  49,855  51,755	

Central government debt, total (% of GDP)

          2009    2010    2011    2012
CA        51.3    51.4    52.5    53.5	
DE        47.6    55.5    55.1    56.9	
FR        85.0    89.2    93.2   103.8	
GB        71.7    85.2    99.6   103.2	
IT       121.3   119.9   113.0   131.1	
JP       166.8   174.8   189.5   196.5	
RU         8.7     9.1     9.3     9.4	
US        76.3    85.6    90.1    93.8	

Let’s put all that data into a fact table like so (PostgreSQL syntax):

CREATE TABLE countries (
  code CHAR(2) NOT NULL,
  year INT NOT NULL,
  gdp_per_capita DECIMAL(10, 2) NOT NULL,
  govt_debt DECIMAL(10, 2) NOT NULL
);

INSERT INTO countries
VALUES ('CA', 2009, 40764, 51.3),
       ('CA', 2010, 47465, 51.4),
       ('CA', 2011, 51791, 52.5),
       ('CA', 2012, 52409, 53.5),
       ('DE', 2009, 40270, 47.6),
       ('DE', 2010, 40408, 55.5),
       ('DE', 2011, 44355, 55.1),
       ('DE', 2012, 42598, 56.9),
       ('FR', 2009, 40488, 85.0),
       ('FR', 2010, 39448, 89.2),
       ('FR', 2011, 42578, 93.2),
       ('FR', 2012, 39759,103.8),
       ('GB', 2009, 35455,121.3),
       ('GB', 2010, 36573, 85.2),
       ('GB', 2011, 38927, 99.6),
       ('GB', 2012, 38649,103.2),
       ('IT', 2009, 35724,121.3),
       ('IT', 2010, 34673,119.9),
       ('IT', 2011, 36988,113.0),
       ('IT', 2012, 33814,131.1),
       ('JP', 2009, 39473,166.8),
       ('JP', 2010, 43118,174.8),
       ('JP', 2011, 46204,189.5),
       ('JP', 2012, 46548,196.5),
       ('RU', 2009,  8616,  8.7),
       ('RU', 2010, 10710,  9.1),
       ('RU', 2011, 13324,  9.3),
       ('RU', 2012, 14091,  9.4),
       ('US', 2009, 46999, 76.3),
       ('US', 2010, 48358, 85.6),
       ('US', 2011, 49855, 90.1),
       ('US', 2012, 51755, 93.8);

Start the querying fun

People who are used to SQL-92 syntax will be able to quickly find the highest GDP per capita or the highest debt from the table. It’s an easy query like this one:

SELECT MAX(gdp_per_capita), MAX(govt_debt)
FROM countries;

Which will return:

52409.00    196.50

But that’s not interesting. We don’t even know what countries and what years these values are associated with.

A standard SQL-92 (and also a standard relational) query to return all of these values would look something like this:

SELECT 
  'highest gdp per capita' AS what,
  c1.*
FROM countries c1
WHERE NOT EXISTS (
  SELECT 1
  FROM countries c2
  WHERE c1.gdp_per_capita < c2.gdp_per_capita
)
UNION ALL
SELECT
  'highest government debt' AS what,
  c1.*
FROM countries c1
WHERE NOT EXISTS (
  SELECT 1
  FROM countries c2
  WHERE c1.govt_debt < c2.govt_debt
)

In essence, we select those rows for which there doesn’t exist any other row with a higher value for either gdp_per_capita (first subselect) or govt_debt (second subselect).

Trick! Use quantified comparison predicates!

If your database supports quantified comparison predicates, then you can write this a bit more concisely like this:

SELECT 
  'highest gdp per capita' AS what,
  countries.*
FROM countries
WHERE gdp_per_capita >= ALL (
  SELECT gdp_per_capita FROM countries
)
UNION ALL
SELECT
  'highest government debt' AS what,
  countries.*
FROM countries
WHERE govt_debt >= ALL (
  SELECT govt_debt FROM countries
)

Which is essentially the same as…

SELECT 
  'highest gdp per capita' AS what,
  countries.*
FROM countries
WHERE gdp_per_capita = (
  SELECT MAX(gdp_per_capita) FROM countries
)
UNION ALL
SELECT
  'highest government debt' AS what,
  countries.*
FROM countries
WHERE govt_debt = (
  SELECT MAX(govt_debt) FROM countries
)

The output would be:

what                     code year       gdp    debt
----------------------------------------------------
highest gdp per capita   CA   2012  52409.00   53.50
highest government debt  JP   2012  46548.00  196.50

That’s a lot of SQL for only little analysis capability, and somehow, it just doesn’t feel entirely right to query the same table four times with all these subselects!

FIRST_VALUE() and LAST_VALUE()

This is where window functions come into play, and in this particular case, FIRST_VALUE() or LAST_VALUE(). For now, let’s focus on calculating the maximum GDP per capita from the data set:

SELECT
  countries.*,
  FIRST_VALUE (code)           OVER (w_gdp) AS max_gdp_code,
  FIRST_VALUE (year)           OVER (w_gdp) AS max_gdp_year,
  FIRST_VALUE (gdp_per_capita) OVER (w_gdp) AS max_gdp_gdp,
  FIRST_VALUE (govt_debt)      OVER (w_gdp) AS max_gdp_debt
FROM
  countries
WINDOW
  w_gdp  AS (ORDER BY gdp_per_capita DESC)
ORDER BY
  code, year

Notice how we make use of the SQL standard WINDOW clause, which is only currently supported by PostgreSQL and Sybase SQL Anywhere.

If you’re using Oracle or any other commercial database, you can simply substitute the window reference w_gdp into the various OVER() clauses to achieve equivalent behaviour – or you can use jOOQ’s WINDOW clause support and let jOOQ do the same for you.

jOOQ generates Java code from your database and lets you build type safe SQL queries through its fluent API.

The above query will not produce any aggregates, but it will add the values for the country / year with the highest GDP per capita to every row in the table:

each country             highest per year
-----------------------------------------------
CA 2009 40764.00 51.30   CA 2012 52409.00 53.50
CA 2010 47465.00 51.40   CA 2012 52409.00 53.50
CA 2011 51791.00 52.50   CA 2012 52409.00 53.50
CA 2012 52409.00 53.50   CA 2012 52409.00 53.50

This is extremely interesting because the data is not yet aggregated – the original data set remains unchanged, enriched with new computed columns.

You can then further process things, e.g. compare each country / year with the highest GDP per capita and with the highest debt per GDP of that country / year:

SELECT
  countries.*,
  TO_CHAR(100 * gdp_per_capita / FIRST_VALUE (gdp_per_capita) OVER (w_gdp) , '999.99 %') gdp_rank,
  TO_CHAR(100 * govt_debt      / FIRST_VALUE (govt_debt)      OVER (w_debt), '999.99 %') debt_rank
FROM
  countries
WINDOW
  w_gdp  AS (PARTITION BY year ORDER BY gdp_per_capita DESC),
  w_debt AS (PARTITION BY year ORDER BY govt_debt DESC)
ORDER BY
  code, year

Notice how I’ve added PARTITION BY to the window definitions of the WINDOW clause. I’ve done this because I want to partition the data set by year, in order to find the highest GDP / debt values for each year, not for the whole data set.

The outcome of the above query can then be seen here:

country                   percentages
------------------------------------------
CA   2009  40764   51.3    86.73%   30.76%
CA   2010  47465   51.4    98.15%   29.41%
CA   2011  51791   52.5   100.00%   27.70%
CA   2012  52409   53.5   100.00%   27.23%
DE   2009  40270   47.6    85.68%   28.54%
DE   2010  40408   55.5    83.56%   31.75%
DE   2011  44355   55.1    85.64%   29.08%
DE   2012  42598   56.9    81.28%   28.96%
FR   2009  40488   85.0    86.15%   50.96%
FR   2010  39448   89.2    81.57%   51.03%
FR   2011  42578   93.2    82.21%   49.18%
FR   2012  39759  103.8    75.86%   52.82%
GB   2009  35455  121.3    75.44%   72.72%
GB   2010  36573   85.2    75.63%   48.74%
GB   2011  38927   99.6    75.16%   52.56%
GB   2012  38649  103.2    73.74%   52.52%
IT   2009  35724  121.3    76.01%   72.72%
IT   2010  34673  119.9    71.70%   68.59%
IT   2011  36988  113.0    71.42%   59.63%
IT   2012  33814  131.1    64.52%   66.72%
JP   2009  39473  166.8    83.99%  100.00%
JP   2010  43118  174.8    89.16%  100.00%
JP   2011  46204  189.5    89.21%  100.00%
JP   2012  46548  196.5    88.82%  100.00%
RU   2009  8616     8.7    18.33%    5.22%
RU   2010  10710    9.1    22.15%    5.21%
RU   2011  13324    9.3    25.73%    4.91%
RU   2012  14091    9.4    26.89%    4.78%
US   2009  46999   76.3   100.00%   45.74%
US   2010  48358   85.6   100.00%   48.97%
US   2011  49855   90.1    96.26%   47.55%
US   2012  51755   93.8    98.75%   47.74%

We could say that among the G8 countries, Canada has really improved the most in the last years, decreasing their debt compared to the GDP on a global comparison, while at the same time increasing their GDP per capita on a global comparison.

Instead of partitioning the data set by year, we could also partition it by country, and find the best / worst year for each country over the years:

SELECT
  countries.*,
  TO_CHAR(100 * gdp_per_capita / FIRST_VALUE (gdp_per_capita) OVER (w_gdp), '999.99 %') gdp_rank,
  TO_CHAR(100 * govt_debt / FIRST_VALUE (govt_debt) OVER (w_debt), '999.99 %') debt_rank
FROM
  countries
WINDOW
  w_gdp  AS (PARTITION BY code ORDER BY gdp_per_capita DESC),
  w_debt AS (PARTITION BY code ORDER BY govt_debt DESC)
ORDER BY
  code, year

The result would now look quite different:

country                    percentages
------------------------------------------
CA   2009  40764   51.3    77.78%   95.89%
CA   2010  47465   51.4    90.57%   96.07%
CA   2011  51791   52.5    98.82%   98.13%
CA   2012  52409   53.5   100.00%  100.00%
DE   2009  40270   47.6    90.79%   83.66%
DE   2010  40408   55.5    91.10%   97.54%
DE   2011  44355   55.1   100.00%   96.84%
DE   2012  42598   56.9    96.04%  100.00%
FR   2009  40488   85.0    95.09%   81.89%
FR   2010  39448   89.2    92.65%   85.93%
FR   2011  42578   93.2   100.00%   89.79%
FR   2012  39759  103.8    93.38%  100.00%
GB   2009  35455  121.3    91.08%  100.00%
GB   2010  36573   85.2    93.95%   70.24%
GB   2011  38927   99.6   100.00%   82.11%
GB   2012  38649  103.2    99.29%   85.08%
IT   2009  35724  121.3    96.58%   92.52%
IT   2010  34673  119.9    93.74%   91.46%
IT   2011  36988  113.0   100.00%   86.19%
IT   2012  33814  131.1    91.42%  100.00%
JP   2009  39473  166.8    84.80%   84.89%
JP   2010  43118  174.8    92.63%   88.96%
JP   2011  46204  189.5    99.26%   96.44%
JP   2012  46548  196.5   100.00%  100.00%
RU   2009   8616    8.7    61.15%   92.55%
RU   2010  10710    9.1    76.01%   96.81%
RU   2011  13324    9.3    94.56%   98.94%
RU   2012  14091    9.4   100.00%  100.00%
US   2009  46999   76.3    90.81%   81.34%
US   2010  48358   85.6    93.44%   91.26%
US   2011  49855   90.1    96.33%   96.06%
US   2012  51755   93.8   100.00%  100.00%

As you can see, most countries have now generally performed better in terms of GDP per capita over the years, and also most countries have almost strictly increased their own debt per GDP (except for Germany, France and Italy), except for the (United Kingdom). Russia and Canada have seen the most growth.

In the above examples, we’ve been mainly using FIRST_VALUE(). LAST_VALUE() is almost the opposite function with respect to ordering, much like MAX() is the opposite function of MIN(). I’m saying almost because there is a caveat when using LAST_VALUE() with ORDER BY, because a window definition that uses ORDER BY is implicitly equivalent to a window definition that uses ORDER BY with a so-called “frame clause”:

-- Find the "last" year over the complete data set
-- This may not behave as expected, so always provide
-- an explicit ORDER BY clause
LAST_VALUE (year) OVER()

-- These two are implicitly equivalent. We're not
-- looking for the "last" year in the complete data
-- set, but only in the frame that is "before" the
-- current row. In other words, the current row is
-- always the "last value"!
LAST_VALUE (year) OVER(ORDER BY year)
LAST_VALUE (year) OVER(
  ORDER BY year 
  ROWS BETWEEN UNBOUNDED PRECEDING 
           AND CURRENT ROW
)

-- Find the "last" year in the complete data set with
-- explicit ordering
LAST_VALUE (year) OVER(
  ORDER BY year 
  ROWS BETWEEN UNBOUNDED PRECEDING 
           AND UNBOUNDED FOLLOWING
)

LEAD() and LAG()

The previous functions were about comparing values with the maximum / minimum (FIRST_VALUE() and LAST_VALUE()) within a data set. But using window functions, you can also compare things with the next / previous value. Or with the second next / second previous, etc. The functions used for this are called LEAD() (for the next value) and LAG() (for the previous value).

This is best explained by example:

-- Use this view as a data source containing
-- all the distinct years: 2009-2012
WITH years AS (
  SELECT DISTINCT year
  FROM countries
)
SELECT
  FIRST_VALUE (year)    OVER w_year AS first,
  LEAD        (year, 2) OVER w_year AS lead2,
  LEAD        (year)    OVER w_year AS lead1,
  year,
  LAG         (year)    OVER w_year AS lag1,
  LAG         (year, 2) OVER w_year AS lag2,
  LAST_VALUE  (year)    OVER w_year AS last
FROM
  years
WINDOW
  w_year AS (
    ORDER BY year DESC
    ROWS BETWEEN UNBOUNDED PRECEDING
             AND UNBOUNDED FOLLOWING
  )
ORDER BY year

The result is now simply:

first  lead2  lead1  year   lag1   lag2   last
----------------------------------------------
2012                 2009   2010   2011   2009
2012          2009   2010   2011   2012   2009
2012   2009   2010   2011   2012          2009
2012   2010   2011   2012                 2009

LEAD() and LAG() are really the best window functions to help understand the whole concept of window functions. For each year, you can see immediately how the previous and next year in the same window and frame can be generated using very simple function calls.

This could be used, for instance, to find the “neighboring” countries in terms of GDP per capita for every country / year:

SELECT
  year,
  code,
  gdp_per_capita,
  LEAD (code)           OVER w_gdp AS runner_up_code,
  LEAD (gdp_per_capita) OVER w_gdp AS runner_up_gdp,
  LAG  (code)           OVER w_gdp AS leader_code,
  LAG  (gdp_per_capita) OVER w_gdp AS leader_gdp
FROM
  countries
WINDOW
  w_gdp AS (PARTITION BY year ORDER BY gdp_per_capita DESC)
ORDER BY year DESC, gdp_per_capita DESC

Which returns:

year   country      runner-up    leader
------------------------------------------
2012   CA  52409    US  51755
2012   US  51755    JP  46548    CA  52409
2012   JP  46548    DE  42598    US  51755
2012   DE  42598    FR  39759    JP  46548
2012   FR  39759    GB  38649    DE  42598
2012   GB  38649    IT  33814    FR  39759
2012   IT  33814    RU  14091    GB  38649
2012   RU  14091                 IT  33814

2011   CA  51791    US  49855
2011   US  49855    JP  46204    CA  51791
2011   JP  46204    DE  44355    US  49855
2011   DE  44355    FR  42578    JP  46204
2011   FR  42578    GB  38927    DE  44355
2011   GB  38927    IT  36988    FR  42578
2011   IT  36988    RU  13324    GB  38927
2011   RU  13324                 IT  36988

2010   US  48358    CA  47465
2010   CA  47465    JP  43118    US  48358
2010   JP  43118    DE  40408    CA  47465
2010   DE  40408    FR  39448    JP  43118
2010   FR  39448    GB  36573    DE  40408
2010   GB  36573    IT  34673    FR  39448
2010   IT  34673    RU  10710    GB  36573
2010   RU  10710                 IT  34673

2009   US  46999    CA  40764
2009   CA  40764    FR  40488    US  46999
2009   FR  40488    DE  40270    CA  40764
2009   DE  40270    JP  39473    FR  40488
2009   JP  39473    IT  35724    DE  40270
2009   IT  35724    GB  35455    JP  39473
2009   GB  35455    RU   8616    IT  35724
2009   RU   8616                 GB  35455

If you want to do more fancy analyses, you could now compare percentages between leaders and runner-ups, etc. Another great use-case for LEAD() and LAG() can be seen in this article.

Conclusion

Window functions are an incredibly powerful feature that is available from all major commercial databases, and also from a couple of Open Source databases like PostgreSQL, Firebird, and CUBRID. There has essentially been SQL before window functions, and SQL after window functions.

With jOOQ, you can leverage window functions on a type safe level like anything else related to SQL. The last query we’ve seen can be written simply like this:

// Static import the generated tables and all
// of jOOQ's functions from DSL
import static org.jooq.example.db.postgres.Tables.*;
import static org.jooq.impl.DSL.*;

// Shorten the table reference by aliasing
Countries c = COUNTRIES;

// Specifiy a window definition
WindowDefinition w_gdp = 
  name("w_gdp").as(
    partitionBy(c.YEAR)
   .orderBy(c.GDP_PER_CAPITA.desc()
  )
);

// Write the query as if it were native SQL
System.out.println(
    DSL.using(conn)
       .select(
           c.YEAR,
           c.CODE,
           c.GDP_PER_CAPITA,
           lead(c.CODE)          .over(w_gdp).as("runner_up_code"),
           lead(c.GDP_PER_CAPITA).over(w_gdp).as("runner_up_gdp"),
           lag (c.CODE)          .over(w_gdp).as("leader_code"),
           lag (c.GDP_PER_CAPITA).over(w_gdp).as("leader_gdp")
       )
       .from(c)
       .window(w_gdp)
       .orderBy(c.YEAR.desc(), c.GDP_PER_CAPITA.desc())
       .fetch()
);

The above program will output

+----+----+--------------+--------------+-------------+-----------+----------+
|year|code|gdp_per_capita|runner_up_code|runner_up_gdp|leader_code|leader_gdp|
+----+----+--------------+--------------+-------------+-----------+----------+
|2012|CA  |      52409.00|US            |     51755.00|{null}     |    {null}|
|2012|US  |      51755.00|JP            |     46548.00|CA         |  52409.00|
|2012|JP  |      46548.00|DE            |     42598.00|US         |  51755.00|
|2012|DE  |      42598.00|FR            |     39759.00|JP         |  46548.00|
|2012|FR  |      39759.00|GB            |     38649.00|DE         |  42598.00|
|2012|GB  |      38649.00|IT            |     33814.00|FR         |  39759.00|
|2012|IT  |      33814.00|RU            |     14091.00|GB         |  38649.00|
|2012|RU  |      14091.00|{null}        |       {null}|IT         |  33814.00|
|2011|CA  |      51791.00|US            |     49855.00|{null}     |    {null}|
|2011|US  |      49855.00|JP            |     46204.00|CA         |  51791.00|
|2011|JP  |      46204.00|DE            |     44355.00|US         |  49855.00|
|2011|DE  |      44355.00|FR            |     42578.00|JP         |  46204.00|
|2011|FR  |      42578.00|GB            |     38927.00|DE         |  44355.00|
|2011|GB  |      38927.00|IT            |     36988.00|FR         |  42578.00|
|2011|IT  |      36988.00|RU            |     13324.00|GB         |  38927.00|
|2011|RU  |      13324.00|{null}        |       {null}|IT         |  36988.00|
|2010|US  |      48358.00|CA            |     47465.00|{null}     |    {null}|
|2010|CA  |      47465.00|JP            |     43118.00|US         |  48358.00|
|2010|JP  |      43118.00|DE            |     40408.00|CA         |  47465.00|
|2010|DE  |      40408.00|FR            |     39448.00|JP         |  43118.00|
|2010|FR  |      39448.00|GB            |     36573.00|DE         |  40408.00|
|2010|GB  |      36573.00|IT            |     34673.00|FR         |  39448.00|
|2010|IT  |      34673.00|RU            |     10710.00|GB         |  36573.00|
|2010|RU  |      10710.00|{null}        |       {null}|IT         |  34673.00|
|2009|US  |      46999.00|CA            |     40764.00|{null}     |    {null}|
|2009|CA  |      40764.00|FR            |     40488.00|US         |  46999.00|
|2009|FR  |      40488.00|DE            |     40270.00|CA         |  40764.00|
|2009|DE  |      40270.00|JP            |     39473.00|FR         |  40488.00|
|2009|JP  |      39473.00|IT            |     35724.00|DE         |  40270.00|
|2009|IT  |      35724.00|GB            |     35455.00|JP         |  39473.00|
|2009|GB  |      35455.00|RU            |      8616.00|IT         |  35724.00|
|2009|RU  |       8616.00|{null}        |       {null}|GB         |  35455.00|
+----+----+--------------+--------------+-------------+-----------+----------+

jOOQ generates Java code from your database and lets you build type safe SQL queries through its fluent API.

No matter whether you’re using jOOQ for your database integration, or just plain SQL – start using window functions today.

Liked this article?

Read more about how ROW_NUMBER(), RANK(), and DENSE_RANK() work.