How to Fetch All Current Identity Values in Oracle

Oracle 12c has introduced the useful SQL standard IDENTITY feature, which is essentially just syntax sugar for binding a sequence to a column default. We can use it like this:

create table t1 (col1 number generated always as identity);
create table t2 (col2 number generated always as identity);

insert into t1 values (default);
insert into t1 values (default);
insert into t1 values (default);
insert into t2 values (default);

select * from t1;
select * from t2;

Which produces

COL1
----
  1
  2
  3

COL2
----
  1

For unit testing against our database, we might want to know what “state” our identities are in. For each table, we would like to know the next value such an identity would produce. If we knew all the backing sequence names, we could query their seq.currval, but we don’t know those sequence names as they are generated.

However, we can query the dictionary views to get this information as follows:

select data_default
from user_tab_cols
where data_default is not null
and identity_column = 'YES'
and table_name in ('T1', 'T2');

An alternative is to query user_tab_identity_cols

This would produce:

"TEST"."ISEQ$$_116601".nextval
"TEST"."ISEQ$$_116603".nextval

Now, if we’re lazy, we could just run EXECUTE IMMEDIATE on each of those expressions and we’re done:

set serveroutput on
declare
  v_current number;
begin
  for rec in (
    select table_name, data_default
    from user_tab_cols
    where data_default is not null
    and identity_column = 'YES'
    and table_name in ('T1', 'T2')
  ) loop
    execute immediate replace(
      'select ' || rec.data_default || ' from dual', 
      '.nextval', 
      '.currval'
    ) into v_current;
    dbms_output.put_line(
      'Table : ' || rec.table_name || 
      ', currval : ' || v_current
    );
  end loop;
end;
/

This would produce:

Table : T1, currval : 3
Table : T2, currval : 1

Alternatively, if you want this result to be a SQL result instead of DBMS_OUTPUT content, you could run this:

with
  function current_value(p_table_name varchar2) return number is
    v_current number;
  begin
    for rec in (
      select data_default
      from user_tab_cols
      where table_name = p_table_name
      and data_default is not null
      and identity_column = 'YES'
    )
    loop
      execute immediate replace(
        'select ' || rec.data_default || ' from dual', 
        '.nextval', 
        '.currval'
      ) into v_current;
      return v_current;
    end loop;
    
    return null;
  end;
select *
from (
  select table_name, current_value(table_name) current_value
  from user_tables
  where table_name in ('T1', 'T2')
)
where current_value is not null
order by table_name;
/

The alternative using user_tab_identity_cols would look like this:

with
  function current_value(p_table_name varchar2) return number is
    v_current number;
  begin
    for rec in (
      select sequence_name
      from user_tab_identity_cols
      where table_name = p_table_name
    )
    loop
      execute immediate 
        'select ' || rec.sequence_name || '.currval from dual'
      into v_current;
      return v_current;
    end loop;
     
    return null;
  end;
select *
from (
  select table_name, current_value(table_name) current_value
  from user_tables
)
where current_value is not null
order by table_name;
/

The result is now a nice SQL result set:

TABLE_NAME   CURRENT_VALUE
--------------------------
T1           3
T2           1

Using IGNORE NULLS With SQL Window Functions to Fill Gaps

I found a very interesting SQL question on Twitter recently:

Rephrasing the question: We have a set of sparse data points:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+

Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+

In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:

  1. Generate all the dates between the first and the last data points
  2. For each date, find either the current data point, or the most recent one

But first, let’s create the data:

create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates

In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose.

We’ll write:

select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level <= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:

VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|

Now we wrap the above query in a derived table and left join the actual data set:

select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:

VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|

2. Fill the value gaps

On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:

last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment)

Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.

This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:

  • DB2
  • H2
  • Informix
  • Oracle
  • Redshift
  • Sybase SQL Anywhere
  • Teradata

Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants.

The full query now reads:

select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:

VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.

The Difference Between SQL’s JOIN .. ON Clause and the Where Clause

A question that is frequently occurring among my SQL training‘s participants is:

What’s the difference between putting a predicate in the JOIN .. ON clause and the WHERE clause?

I can definitely see how that’s confusing some people, as there seems to be no difference at first sight, when running queries like these, e.g. in Oracle. I’m using the Sakila database, as always:

-- First query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) DESC;

This will yield something like:

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
108       WARREN      NOLTE      3
162       OPRAH       KILMER     3
19        BOB         FAWCETT    2
10        CHRISTIAN   GABLE      2
53        MENA        TEMPLE     2
137       MORGAN      WILLIAMS   1
2         NICK        WAHLBERG   1

Of course, we could have written this instead, and received the same result:

-- Second query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
  AND fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) DESC;

Now, I’ve moved the FILM_ID < 10 filter from the WHERE clause to the ON clause. But the execution plan is the same for both queries:

---------------------------------------------------------
| Id  | Operation               | Name          | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT        |               |    49 |
|   1 |  SORT ORDER BY          |               |    49 |
|   2 |   HASH GROUP BY         |               |    49 |
|*  3 |    HASH JOIN            |               |    49 |
|*  4 |     INDEX FAST FULL SCAN| PK_FILM_ACTOR |    49 |
|   5 |     VIEW                | VW_GBF_7      |   200 |
|   6 |      TABLE ACCESS FULL  | ACTOR         |   200 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ITEM_1"="FA"."ACTOR_ID")
   4 - filter("FA"."FILM_ID"<10)

It does not seem matter at all. Both queries yield the same result as well as the same plan. So…

Are ON and WHERE really the same thing?

They are when you run an inner join. But they are not when you run an outer join.

And now, let’s compare these two queries here:

-- First query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) ASC;

Yielding

ACTOR_ID  FIRST_NAME  LAST_NAME  COUNT
--------------------------------------
194       MERYL       ALLEN      1
198       MARY        KEITEL     1
30        SANDRA      PECK       1
85        MINNIE      ZELLWEGER  1
123       JULIANNE    DENCH      1

Notice that with this syntax, we’re not getting any actors that have no films with FILM_ID < 10. We should get dozens! How about this:

-- Second query
SELECT a.actor_id, a.first_name, a.last_name, count(fa.film_id)
FROM actor a
LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
  AND fa.film_id < 10
GROUP BY a.actor_id, a.first_name, a.last_name
ORDER BY count(fa.film_id) ASC;

This used to produce the same result for an (INNER) JOIN, but given the LEFT JOIN, we’re now not getting Susan Davis in the result:

ACTOR_ID  FIRST_NAME  LAST_NAME     COUNT
-----------------------------------------
3         ED          CHASE         0
4         JENNIFER    DAVIS         0
5         JOHNNY      LOLLOBRIGIDA  0
6         BETTE       NICHOLSON	    0
...
1         PENELOPE    GUINESS       1
200       THORA       TEMPLE        1
2         NICK        WAHLBERG      1
198       MARY        KEITEL        1

The plans are also different:

---------------------------------------------------------
| Id  | Operation               | Name          | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT        |               |    49 |
|   1 |  SORT ORDER BY          |               |    49 |
|   2 |   HASH GROUP BY         |               |    49 |
|*  3 |    HASH JOIN            |               |    49 |
|*  4 |     INDEX FAST FULL SCAN| PK_FILM_ACTOR |    49 |
|   5 |     VIEW                | VW_GBF_7      |   200 |
|   6 |      TABLE ACCESS FULL  | ACTOR         |   200 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("ITEM_1"="FA"."ACTOR_ID")
   4 - filter("FA"."FILM_ID"<10)

No outer join here! Versus

---------------------------------------------------------------
| Id  | Operation                     | Name          | Rows  |
---------------------------------------------------------------
|   0 | SELECT STATEMENT              |               |   200 |
|   1 |  SORT ORDER BY                |               |   200 |
|   2 |   MERGE JOIN OUTER            |               |   200 |
|   3 |    TABLE ACCESS BY INDEX ROWID| ACTOR         |   200 |
|   4 |     INDEX FULL SCAN           | PK_ACTOR      |   200 |
|*  5 |    SORT JOIN                  |               |    44 |
|   6 |     VIEW                      | VW_GBC_5      |    44 |
|   7 |      HASH GROUP BY            |               |    44 |
|*  8 |       INDEX FAST FULL SCAN    | PK_FILM_ACTOR |    49 |
---------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("A"."ACTOR_ID"="ITEM_1"(+))
       filter("A"."ACTOR_ID"="ITEM_1"(+))
   8 - filter("FILM_ID"(+)<10)

The first query did not produce an outer join operation, the second one did!

What’s the difference?

The difference is:

  • An INNER JOIN produces all the actors who played in at least one film, filtering out the actors who did not play in a film. That’s the very definition of an inner join. If we filter the films with FILM_ID < 10, that simply means we don’t want any actors without such films either.
  • A LEFT JOIN will produce all the rows from the left side of the join, regardless if there is a matching row on the right side of the join.

In both cases, the matching rows are determined by the ON clause. If two rows don’t match, then:

  • The INNER JOIN removes them both from the result
  • The LEFT JOIN retains the left row in the result

But regardless what the JOIN produces, the WHERE clause will again remove rows that do not satisfy the filter. So,

  • In the INNER JOIN case, it does not matter if we remove actors with no films, and then actors without films with FILM_ID < 10, OR if we remove actors with no films with FILM_ID < 10 directly. They’re going to be removed anyway.
  • In the LEFT JOIN case, it does matter if we retain actors with no films, and then remove actors without films with FILM_ID < 10 (in case of which actors without films will be removed again), OR if we retain actors without films with FILM_ID < 10, and then not apply any further filters.

Conclusion

For INNER JOIN, WHERE predicates and ON predicates have the same effect.

For OUTER JOIN, WHERE predicates and ON predicates have a different effect.

In general, it is always best to put a predicate where it belongs, logically. If the predicate is related to a JOIN operation, it belongs in the ON clause. If a predicate is related to a filter applied to the entire FROM clause, it belongs in the WHERE clause.

The Cost of Useless Surrogate Keys in Relationship Tables

What’s a good natural key?

This is a very difficult question for most entities when you design your schema. In some rare cases, there seems to be an “obvious” candidate, such as a variety of ISO standards, including:

But even in those cases, there might be exceptions and the worst thing that can happen is a key change. Most database designs play it safe and use surrogate keys instead. Nothing wrong with that. But…

Relationship tables

There is one exception where a surrogate key is never really required. Those are relationship tables. For example, in the Sakila database, all relationship tables lack a surrogate key and use their respective foreign keys as a compound “natural” primary key instead:

So, the FILM_ACTOR table, for example, is defined as such:

CREATE TABLE film_actor (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Clustered indexes

In many RDBMS, when creating a table, you get to choose whether to use a “clustered index” or a “non clustered index” table layout. The main difference is:

Clustered index

… is a primary key index that “clusters” data together, which belongs together. In other words:

  • All the index column values are contained in the index tree structure
  • All the other column values are contained in the index leaf nodes

The benefit of this table layout is that primary key lookups can be much faster because your entire row is located in the index, which requires less disk I/O than the non clustered index for primary key lookups. The price for this is slower secondary index searches (e.g. searching for last names). The algorithmic complexities are:

  • O(log N) for primary key lookups
  • O(log N) for secondary key lookups plus O(M log N) for projections of non-secondary-key columns (quite a high price to pay)

… where

  • N is the size of the table
  • M is the number of rows that are searched in secondary keys

OLTP usage often profits from clustered indexes.

Non clustered index

… is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:

  • All the index column values are contained in the index tree structure
  • All the index column values and other column values are contained in the heap table

The benefit of this table layout is that all lookups are equally fast, regardless if you’re using a primary key lookup or a secondary key search. There’s always an additional, constant time heap table lookup. The algorithmic complexities are:

  • O(log N) for primary key lookups plus O(1) for projections of non-primary-key columns (a moderate price to pay)
  • O(log N) for secondary key lookups plus O(M) for projections of non-secondary-key columns (a moderate price to pay)

OLAP usage definitely profits from heap tables.

Defaults

  • MySQL’s InnoDB offers clustered indexes only.
  • MySQL’s MyISAM offers heap tables only.
  • Oracle offers both and defaults to heap tables
  • PostgreSQL offers both and defaults to heap tables
  • SQL Server offers both and defaults to clustered indexes

Note that Oracle calls clustered indexes “index organised tables”

Performance

In this article, I’m checking MySQL’s performance as MySQL’s InnoDB doesn’t offer to switch the table layout. Curiously, the problems shown below could not be reproduced on PostgreSQL as shown by reddit user /u/ForeverAlot. Details here.

With the algorithmic complexities above, we can easily guess what I’m trying to hint at here. In the presence of a clustered index, we should avoid expensive secondary key searches when possible. Of course, these searches cannot always be avoided, but if we review the alternative design of these two tables:

CREATE TABLE film_actor_surrogate (
  id int NOT NULL,
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id)
);

CREATE TABLE film_actor_natural (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

… we can see that if we’re using a clustered index here, the clustering will be made based on either:

  • FILM_ACTOR_SURROGATE.ID, which is a very useless clustering
  • (FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID), which is a very useful clustering

In the latter case, whenever we look up an actor’s films, we can use the clustering index as a covering index, regardless if we project anything additional from that table or not.

In the former case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID), and chances are that secondary index is not covering if we have additional projections.

The surrogate key clustering is really useless, because we never use the table this way.

Does it matter?

We can easily design a benchmark for this case. You can find the complete benchmark code here on GitHub, to validate the results on your environment. The benchmark uses this database design:

create table parent_1 (id int not null primary key);
create table parent_2 (id int not null primary key);

create table child_surrogate (
  id int auto_increment, 
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (id), 
  unique (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

create table child_natural (
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

Unlike in the Sakila database, we’re now adding some “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE storage clause to see how this would perform with MyISAM, which only supports heap tables.

The benchmark adds:

  • 10 000 rows in PARENT_1
  • 100 rows in PARENT_2
  • 1 000 000 rows in both CHILD tables (just a cross join of the above)

And then, it runs 5 iterations of 10000 repetitions of the following two queries, following our standard SQL benchmark technique:

-- Query 1
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_surrogate AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

-- Query 2
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_natural AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

Notice that MySQL does not implement join elimination, otherwise, the useless join to PARENT_1 would be eliminated. The benchmark results are very clear:

Using InnoDB (clustered indexes)

Run 0, Statement 1 : 3104
Run 0, Statement 2 : 1910
Run 1, Statement 1 : 3097
Run 1, Statement 2 : 1905
Run 2, Statement 1 : 3045
Run 2, Statement 2 : 2276
Run 3, Statement 1 : 3589
Run 3, Statement 2 : 1910
Run 4, Statement 1 : 2961
Run 4, Statement 2 : 1897

Using MyISAM (heap tables)

Run 0, Statement 1 : 3473
Run 0, Statement 2 : 3288
Run 1, Statement 1 : 3328
Run 1, Statement 2 : 3341
Run 2, Statement 1 : 3674
Run 2, Statement 2 : 3307
Run 3, Statement 1 : 3373
Run 3, Statement 2 : 3275
Run 4, Statement 1 : 3298
Run 4, Statement 2 : 3322

You shouldn’t read this as a comparison between InnoDB and MyISAM in general, but as a comparison of the different table structures within the boundaries of the same engine. Very obviously, the additional search complexity of the badly clustered index in CHILD_SURROGATE causes a 50% slower query execution on this type of query, without gaining anything.

In the case of the heap table, the additional surrogate key column did not have any significant effect.

Again, the full benchmark can be found here on GitHub, if you want to repeat it.

Conclusion

Not everyone agrees what is generally better: clustered or non clustered indexes. Not everyone agrees on the utility of surrogate keys on every table. These are both quite opinionated discussions.

But this article clearly showed that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only doesn’t add value, but it actively hurts your performance on a set of queries when your table is using a clustered index.

MySQL’s InnoDB and SQL Server use clustered indexes by default, so if you’re using any of those RDBMS, do check if you have room for significant improvement by dropping your surrogate keys.

Calculating Weighted Averages When Joining Tables in SQL

I stumbled upon a very interesting jOOQ question on Stack Overflow that required the calculation of a weighted average. Why is that.

Problem description

Assuming you have this database (using PostgreSQL syntax):

create table transactions (
  id     bigint         not null primary key,
  lines  bigint         not null,
  price  numeric(18, 2) not null,
  profit numeric(18, 2) not null
);

create table lines (
  id             bigint         not null primary key,
  transaction_id bigint         not null references transactions,
  total          bigint         not null,
  quantity       bigint         not null,
  profit         numeric(18, 2) not null
);

As can be seen, this schema is slightly denormalised as the number of lines per transaction are precalculated in the transactions.lines column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.

Now, in the previously linked Stack Overflow question, a report was desired that would calculate:

  • An aggregation of sums as provided by the line items
  • An aggregation of averages as provided by the transactions

This would be straightforward with two separate queries:

Sums provided by the line items

SELECT
  sum(profit)   AS total_profit,
  sum(total)    AS total_sales_amount,
  sum(quantity) AS total_items_sold
FROM lines

Averages provided by the transactions

SELECT
  avg(lines)  AS avg_items_p_trx,
  avg(price)  AS avg_price_p_trx,
  avg(profit) AS avg_profit_p_trx
FROM transactions

So far so good.

Doing it in one query

Now, these queries are simplified from the original, which needed to join the two tables in order to add additional predicates. Also, let’s assume that these tables are quite large, so running two queries might lead to the report being too slow. A single query would be much better.

We might be attempted to simply combined the two:

-- Wrong query
SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  avg(t.lines)    AS avg_items_p_trx,
  avg(t.price)    AS avg_price_p_trx,
  avg(t.profit)   AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

But this query is wrong. While the sums are still correct, the averages are not, simply because the join produces duplicate transaction rows per lines. Imagine a transaction having 3 or 5 lines:

SELECT
  l.id    AS line_id,
  t.id    AS transaction_id,
  t.lines,
  t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

The output would be:

LINE_ID    TRANSACTION_ID    LINES    PRICE
-------------------------------------------
1          1                 3        20.00
2          1                 3        20.00
3          1                 3        20.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
  • The average number of lines “avg_items_p_trx” should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate avg(t.lines) over the entire data set, we get 4.25 (3×3 lines + 5×5 lines) / 8 items.
  • The average price “avg_price_p_trx” should be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate avg(t.price) over the entire data set, we get 80.00 (3×20.00 + 5×100.00) / 8 items.

How can this be fixed?

Given that each transaction is duplicated because of the join with lines, we have to calculate a weighted average, not an ordinary average. The idea is that instead of using the AVG() aggregate function, we now have to divide the value we want to get an average of by the number of items (i.e. the number of times the value is repeated because of the join), and then divide the sum of that division by the number of transactions.

Prose never describes logic well, so let’s use code. The correct query is:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  sum(t.lines  / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

With the above data set:

LINE_ID  TRANSACTION_ID  LINES  LINES/LINES   PRICE  PRICE/LINES
----------------------------------------------------------------
1        1               3      1             20.00         6.66
2        1               3      1             20.00         6.66
3        1               3      1             20.00         6.66
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00

We now get the correct weighted averages:

  • The average number of lines “avg_items_p_trx” is now 4 =
    (3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / distinct transactions
  • The average price “avg_price_p_trx” is now 60.00 =
    (20.00/3 + 20.00/3 + 20.00/3 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5) / 2 distinct transactions

Note that “avg_items_p_trx” can be simplified:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

Done!

Normalised version

Notice that this solution profited from the fact that the number of lines per transaction was pre-calculated. We can of course also calculate it on the fly, e.g. using window functions. If it weren’t available, we could do it like this:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
  SELECT 
    l.*,
    count(*) OVER (PARTITION BY l.transaction_id) lines
  FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

Or, we turn the entire join into a 1:1 relationship by pre-aggregating all the data from lines into one row per transaction. This works because we only calculate sums from the lines table:

SELECT
  sum(l.profit_per_transaction)   AS total_profit,
  sum(l.total_per_transaction)    AS total_sales_amount,
  sum(l.quantity_per_transaction) AS total_items_sold,
  avg(l.lines_per_transaction)    AS avg_items_p_trx,
  avg(t.price)                    AS avg_price_p_trx,
  avg(t.profit)                   AS avg_profit_p_trx
FROM (
  SELECT 
    l.transaction_id
    sum(l.profit)   AS profit_per_transaction,
    sum(l.total)    AS total_per_transaction,
    sum(l.quantity) AS quantity_per_transaction,
    count(*)        AS lines_per_transaction
  FROM lines AS l
  GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

How to Calculate a Cumulative Percentage in SQL

A fun report to write is to calculate a cumulative percentage. For example, when querying the Sakila database, we might want to calculate the percentage of our total revenue at any given date.

The result might look like this:

Notice the beautifully generated data. Or as raw data:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

In other words, at the beginning of our timeline, we’ve made 0% revenue, and then that percentage increases over time, until we reach 100% of our revenue at the end of our timeline.

How to do it?

We’re going to do it in two steps. Our PAYMENT table has a PAYMENT_DATE column, which is really a timestamp, i.e. the exact amount in time when we received a payment. We can query the table to see its data (I will be using PostgreSQL syntax in this post):

SELECT
  payment_date,
  amount
FROM payment
ORDER BY payment_date;

This yields:

payment_date        |amount
--------------------|------
2005-05-24 22:53:30 |2.99  
2005-05-24 22:54:33 |2.99  
2005-05-24 23:03:39 |3.99  
2005-05-24 23:04:41 |4.99  
2005-05-24 23:05:21 |6.99  
2005-05-24 23:08:07 |0.99  
2005-05-24 23:11:53 |1.99  
2005-05-24 23:31:46 |4.99  
2005-05-25 00:00:40 |4.99  
2005-05-25 00:02:21 |5.99  
2005-05-25 00:09:02 |8.99  
2005-05-25 00:19:27 |4.99  
2005-05-25 00:22:55 |6.99  
...

Now we could calculate that percentage on this timeline, but that wouldn’t be terribly interesting. We’re interested in the cumulative revenue per date, so let’s run a classic GROUP BY:

SELECT 
  CAST(payment_date AS DATE),
  sum(amount) AS amount
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

This yields the first two columns of our desired result:

payment_date |amount 
-------------|-------
2005-05-24   |29.92  
2005-05-25   |573.63 
2005-05-26   |754.26 
2005-05-27   |685.33 
2005-05-28   |804.04 
2005-05-29   |648.46 
2005-05-30   |628.42 
2005-05-31   |700.37 
...
2005-08-18   |2710.79
2005-08-19   |2615.72
2005-08-20   |2723.76
2005-08-21   |2809.41
2005-08-22   |2576.74
2005-08-23   |2523.01
2005-08-24   |514.18 

Now about that percentage. The formula in pseudo SQL is this:

cumulative_percentage[N] = SUM(amount[M <= N]) / SUM(amount[any])

In other words, the percentage of the revenue we’ve made up until a given day is equal to the SUM of all amounts until that day divided by the SUM of all amounts. We could do that relatively easily in Microsoft Excel. But we can also do it with SQL, using window functions. The syntax is:

-- Sum of all amounts until that day:
SUM(amount) OVER (ORDER BY payment_date)

-- Sum of all amounts
SUM(amount) OVER ()

So, let’s just plug that into our SQL. For simplicity, we’ll first nest our previous GROUP BY statement in a derived table:

SELECT 
  payment_date,
  amount,
  CAST(100 * sum(amount) OVER (ORDER BY payment_date) 
           / sum(amount) OVER () AS numeric(10, 2)) percentage
FROM (
  SELECT 
    CAST(payment_date AS DATE),
    sum(amount) AS amount
  FROM payment
  GROUP BY CAST(payment_date AS DATE)
) p
ORDER BY payment_date;

Running this yields the desired result:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

Bonus: Nest aggregate functions in window functions

Because of the nature of SQL syntax, and the fact that both GROUP BY and aggregate functions “happen before” window functions, i.e. they are calculated logically before window functions, we can nest aggregate functions in window functions.

This definitely doesn’t drastically improve readability, especially if you are not used to writing window functions every day. But in some more complex cases, it might help to shorten your SQL syntax. The above query is equivalent to this one:

SELECT 
  CAST(payment_date AS DATE) AS payment_date,
  sum(amount) AS amount,
  CAST(100 * sum(sum(amount)) OVER (
               ORDER BY CAST(payment_date AS DATE)) 
           / sum(sum(amount)) OVER () AS numeric(10, 2)) percentage
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

Beauty is in the eye of the beholder. My eye definitely likes this sum(sum(amount)) OVER () syntax. If you cannot decipher this, don’t worry. You’re not alone. I invite you to review the following post on the order of SQL operations, first.

How to Emulate PERCENTILE_DISC in MySQL and Other RDBMS

In my previous article, I showed what the very useful percentile functions (also known as inverse distribution functions) can be used for.

Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:

Dialect As aggregate function As window function
MariaDB 10.3.3 No Yes
Oracle 18c Yes Yes
PostgreSQL 11 Yes No
SQL Server 2017 No Yes
Teradata 16 Yes No

Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:

  • Aggregate function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)
  • Window function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)

Workarounds if the feature is unavailable

Luckily, as soon as an RDBMS supports window functions, we can easily emulate PERCENTILE_DISC using PERCENT_RANK and FIRST_VALUE as follows. We’re using the Sakila database in this example.

Emulating window functions

Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:

SELECT DISTINCT
  rating,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER() x1,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER (PARTITION BY rating) x2
FROM film
ORDER BY rating;

Yielding

RATING  X1      X2
-------------------
G       114     107
NC-17   114     112
PG      114     113
PG-13   114     125
R       114     115

What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films per rating range from 107 minutes to 125 minutes. I’ve used DISTINCT because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.

Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support PERCENT_RANK and FIRST_VALUE. Here’s the complete query:

SELECT DISTINCT
  rating,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (ORDER BY length) p1,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating;

So, we’re doing this in two steps (visual example further down):

  1. PERCENT_RANK: In a derived table, we’re calculating the PERCENT_RANK value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whose PERCENT_RANK is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whose PERCENT_RANK is 0.9 or less
  2. FIRST_VALUE: Once we’ve found the PERCENT_RANK, we’re not quite done yet. We need to find the last row whose PERCENT_RANK is less or equal to the percentile we’re interested in. I could have used LAST_VALUE, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows by PERCENT_RANK (p1 or p2), I translated all ranks higher than the percentile I’m looking for into NULL using a CASE expression, and then I made sure using NULLS LAST that the percentile I’m looking for will be the first row in the FIRST_VALUE function’s window specification. Easy!

To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:

SELECT
  length,
  CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
FROM (
  SELECT
    length,
    percent_rank() OVER (ORDER BY length) p1
  FROM film
) t
ORDER BY length;

The result is

length |p1   |x1  |
-------|-----|----|
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
47     |0.01 |114 |
...
113    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
...
185    |     |114 |
185    |     |114 |
185    |     |114 |

So the FIRST_VALUE function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.

The same for p2:

SELECT 
  length,
  rating,
  CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating, length;

Yielding:

length |rating |p2   |x2  |
-------|-------|-----|----|
47     |G      |0.00 |107 |
47     |G      |0.00 |107 |
48     |G      |0.01 |107 |
48     |G      |0.01 |107 |
...
105    |G      |0.47 |107 |
106    |G      |0.49 |107 |
107    |G      |0.49 |107 |
107    |G      |0.49 |107 | <-- Last row in G partition whose
108    |G      |     |107 |     PERCENT_RANK is <= 0.5
108    |G      |     |107 |
109    |G      |     |107 |
...
185    |G      |     |107 |
185    |G      |     |107 |
46     |PG     |0.00 |113 |
47     |PG     |0.01 |113 |
47     |PG     |0.01 |113 |
...
111    |PG     |0.49 |113 |
113    |PG     |0.49 |113 |
113    |PG     |0.49 |113 | <-- Last row in PG partition whose
114    |PG     |     |113 |     PERCENT_RANK is <= 0.5
114    |PG     |     |113 |
...

Perfect! Notice if your RDBMS doesn’t support the NULLS LAST clause in your ORDER BY clause (e.g. MySQL), you might either hope that it defaults to sorting NULLS LAST (MySQL does), or you can emulate it as such:

-- This
ORDER BY x NULLS LAST

-- Is the same as this
ORDER BY
  CASE WHEN x IS NULL THEN 1 ELSE 0 END,
  x

Emulating aggregate functions

If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using DISTINCT. It will probably be easier than the emulation below. Do check for performance though!

When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.

Here’s how to use the aggregate function version in Oracle:

-- Without GROUP BY
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1
FROM film;

-- With GROUP BY
SELECT
  rating,
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2
FROM film
GROUP BY rating
ORDER BY rating;

Trivial! The result is the same as before:

X1
---
114


RATING  X2
-----------
G       107
NC-17   112
PG      113
PG-13   125
R       115

Now, let’s emulate these on e.g. MySQL, using window functions.

-- Without GROUP BY
SELECT
  MAX(x1) x1
FROM (
  SELECT first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
  FROM (
    SELECT
      length,
      percent_rank() OVER (ORDER BY length) p1
    FROM film
  ) t
) t;

It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as MAX(). This is the same as what I did before with DISTINCT, for illustration purposes.

-- With GROUP BY
SELECT
  rating,
  MAX(x2) x2
FROM (
  SELECT
    rating,
    first_value(length) OVER (
      PARTITION BY rating 
      ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
  FROM (
    SELECT
      rating,
      length,
      percent_rank() OVER (
        PARTITION BY rating 
        ORDER BY length) p2
    FROM film
  ) t
) t
GROUP BY rating
ORDER BY rating;

All we’re really doing (again) is translate the GROUP BY expression to a PARTITION BY expression in the window function, and then redo the previous exercise.

Conclusion

Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the PERCENTILE_DISC inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses PERCENT_RANK and FIRST_VALUE in all RDBMS that support window functions. A similar exercise could be made with PERCENTILE_CONT with a slightly more tricky approach to finding that FIRST_VALUE, which I’ll leave as an exercise to the reader.

A future jOOQ version might emulate this for you, automatically.

Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.