Calculating Weighted Averages When Joining Tables in SQL

I stumbled upon a very interesting jOOQ question on Stack Overflow that required the calculation of a weighted average. Why is that.

Problem description

Assuming you have this database (using PostgreSQL syntax):

create table transactions (
  id     bigint         not null primary key,
  lines  bigint         not null,
  price  numeric(18, 2) not null,
  profit numeric(18, 2) not null
);

create table lines (
  id             bigint         not null primary key,
  transaction_id bigint         not null references transactions,
  total          bigint         not null,
  quantity       bigint         not null,
  profit         numeric(18, 2) not null
);

As can be seen, this schema is slightly denormalised as the number of lines per transaction are precalculated in the transactions.lines column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.

Now, in the previously linked Stack Overflow question, a report was desired that would calculate:

  • An aggregation of sums as provided by the line items
  • An aggregation of averages as provided by the transactions

This would be straightforward with two separate queries:

Sums provided by the line items

SELECT
  sum(profit)   AS total_profit,
  sum(total)    AS total_sales_amount,
  sum(quantity) AS total_items_sold
FROM lines

Averages provided by the transactions

SELECT
  avg(lines)  AS avg_items_p_trx,
  avg(price)  AS avg_price_p_trx,
  avg(profit) AS avg_profit_p_trx
FROM transactions

So far so good.

Doing it in one query

Now, these queries are simplified from the original, which needed to join the two tables in order to add additional predicates. Also, let’s assume that these tables are quite large, so running two queries might lead to the report being too slow. A single query would be much better.

We might be attempted to simply combined the two:

-- Wrong query
SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  avg(t.lines)    AS avg_items_p_trx,
  avg(t.price)    AS avg_price_p_trx,
  avg(t.profit)   AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

But this query is wrong. While the sums are still correct, the averages are not, simply because the join produces duplicate transaction rows per lines. Imagine a transaction having 3 or 5 lines:

SELECT
  l.id    AS line_id,
  t.id    AS transaction_id,
  t.lines,
  t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

The output would be:

LINE_ID    TRANSACTION_ID    LINES    PRICE
-------------------------------------------
1          1                 3        20.00
2          1                 3        20.00
3          1                 3        20.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
  • The average number of lines “avg_items_p_trx” should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate avg(t.lines) over the entire data set, we get 4.25 (3×3 lines + 5×5 lines) / 8 items.
  • The average price “avg_price_p_trx” should be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate avg(t.price) over the entire data set, we get 80.00 (3×20.00 + 5×100.00) / 8 items.

How can this be fixed?

Given that each transaction is duplicated because of the join with lines, we have to calculate a weighted average, not an ordinary average. The idea is that instead of using the AVG() aggregate function, we now have to divide the value we want to get an average of by the number of items (i.e. the number of times the value is repeated because of the join), and then divide the sum of that division by the number of transactions.

Prose never describes logic well, so let’s use code. The correct query is:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  sum(t.lines  / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

With the above data set:

LINE_ID  TRANSACTION_ID  LINES  LINES/LINES   PRICE  PRICE/LINES
----------------------------------------------------------------
1        1               3      1             20.00         6.66
2        1               3      1             20.00         6.66
3        1               3      1             20.00         6.66
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00

We now get the correct weighted averages:

  • The average number of lines “avg_items_p_trx” is now 4 =
    (3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / distinct transactions
  • The average price “avg_price_p_trx” is now 60.00 =
    (20.00/3 + 20.00/3 + 20.00/3 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5) / 2 distinct transactions

Note that “avg_items_p_trx” can be simplified:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

Done!

Normalised version

Notice that this solution profited from the fact that the number of lines per transaction was pre-calculated. We can of course also calculate it on the fly, e.g. using window functions. If it weren’t available, we could do it like this:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
  SELECT 
    l.*,
    count(*) OVER (PARTITION BY l.transaction_id) lines
  FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

Or, we turn the entire join into a 1:1 relationship by pre-aggregating all the data from lines into one row per transaction. This works because we only calculate sums from the lines table:

SELECT
  sum(l.profit_per_transaction)   AS total_profit,
  sum(l.total_per_transaction)    AS total_sales_amount,
  sum(l.quantity_per_transaction) AS total_items_sold,
  avg(l.lines_per_transaction)    AS avg_items_p_trx,
  avg(t.price)                    AS avg_price_p_trx,
  avg(t.profit)                   AS avg_profit_p_trx
FROM (
  SELECT 
    l.transaction_id
    sum(l.profit)   AS profit_per_transaction,
    sum(l.total)    AS total_per_transaction,
    sum(l.quantity) AS quantity_per_transaction,
    count(*)        AS lines_per_transaction
  FROM lines AS l
  GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

How to Calculate a Cumulative Percentage in SQL

A fun report to write is to calculate a cumulative percentage. For example, when querying the Sakila database, we might want to calculate the percentage of our total revenue at any given date.

The result might look like this:

Notice the beautifully generated data. Or as raw data:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

In other words, at the beginning of our timeline, we’ve made 0% revenue, and then that percentage increases over time, until we reach 100% of our revenue at the end of our timeline.

How to do it?

We’re going to do it in two steps. Our PAYMENT table has a PAYMENT_DATE column, which is really a timestamp, i.e. the exact amount in time when we received a payment. We can query the table to see its data (I will be using PostgreSQL syntax in this post):

SELECT
  payment_date,
  amount
FROM payment
ORDER BY payment_date;

This yields:

payment_date        |amount
--------------------|------
2005-05-24 22:53:30 |2.99  
2005-05-24 22:54:33 |2.99  
2005-05-24 23:03:39 |3.99  
2005-05-24 23:04:41 |4.99  
2005-05-24 23:05:21 |6.99  
2005-05-24 23:08:07 |0.99  
2005-05-24 23:11:53 |1.99  
2005-05-24 23:31:46 |4.99  
2005-05-25 00:00:40 |4.99  
2005-05-25 00:02:21 |5.99  
2005-05-25 00:09:02 |8.99  
2005-05-25 00:19:27 |4.99  
2005-05-25 00:22:55 |6.99  
...

Now we could calculate that percentage on this timeline, but that wouldn’t be terribly interesting. We’re interested in the cumulative revenue per date, so let’s run a classic GROUP BY:

SELECT 
  CAST(payment_date AS DATE),
  sum(amount) AS amount
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

This yields the first two columns of our desired result:

payment_date |amount 
-------------|-------
2005-05-24   |29.92  
2005-05-25   |573.63 
2005-05-26   |754.26 
2005-05-27   |685.33 
2005-05-28   |804.04 
2005-05-29   |648.46 
2005-05-30   |628.42 
2005-05-31   |700.37 
...
2005-08-18   |2710.79
2005-08-19   |2615.72
2005-08-20   |2723.76
2005-08-21   |2809.41
2005-08-22   |2576.74
2005-08-23   |2523.01
2005-08-24   |514.18 

Now about that percentage. The formula in pseudo SQL is this:

cumulative_percentage[N] = SUM(amount[M <= N]) / SUM(amount[any])

In other words, the percentage of the revenue we’ve made up until a given day is equal to the SUM of all amounts until that day divided by the SUM of all amounts. We could do that relatively easily in Microsoft Excel. But we can also do it with SQL, using window functions. The syntax is:

-- Sum of all amounts until that day:
SUM(amount) OVER (ORDER BY payment_date)

-- Sum of all amounts
SUM(amount) OVER ()

So, let’s just plug that into our SQL. For simplicity, we’ll first nest our previous GROUP BY statement in a derived table:

SELECT 
  payment_date,
  amount,
  CAST(100 * sum(amount) OVER (ORDER BY payment_date) 
           / sum(amount) OVER () AS numeric(10, 2)) percentage
FROM (
  SELECT 
    CAST(payment_date AS DATE),
    sum(amount) AS amount
  FROM payment
  GROUP BY CAST(payment_date AS DATE)
) p
ORDER BY payment_date;

Running this yields the desired result:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

Bonus: Nest aggregate functions in window functions

Because of the nature of SQL syntax, and the fact that both GROUP BY and aggregate functions “happen before” window functions, i.e. they are calculated logically before window functions, we can nest aggregate functions in window functions.

This definitely doesn’t drastically improve readability, especially if you are not used to writing window functions every day. But in some more complex cases, it might help to shorten your SQL syntax. The above query is equivalent to this one:

SELECT 
  CAST(payment_date AS DATE) AS payment_date,
  sum(amount) AS amount,
  CAST(100 * sum(sum(amount)) OVER (
               ORDER BY CAST(payment_date AS DATE)) 
           / sum(sum(amount)) OVER () AS numeric(10, 2)) percentage
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

Beauty is in the eye of the beholder. My eye definitely likes this sum(sum(amount)) OVER () syntax. If you cannot decipher this, don’t worry. You’re not alone. I invite you to review the following post on the order of SQL operations, first.

How to Emulate PERCENTILE_DISC in MySQL and Other RDBMS

In my previous article, I showed what the very useful percentile functions (also known as inverse distribution functions) can be used for.

Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:

Dialect As aggregate function As window function
MariaDB 10.3.3 No Yes
Oracle 18c Yes Yes
PostgreSQL 11 Yes No
SQL Server 2017 No Yes
Teradata 16 Yes No

Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:

  • Aggregate function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)
  • Window function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)

Workarounds if the feature is unavailable

Luckily, as soon as an RDBMS supports window functions, we can easily emulate PERCENTILE_DISC using PERCENT_RANK and FIRST_VALUE as follows. We’re using the Sakila database in this example.

Emulating window functions

Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:

SELECT DISTINCT
  rating,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER() x1,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER (PARTITION BY rating) x2
FROM film
ORDER BY rating;

Yielding

RATING  X1      X2
-------------------
G       114     107
NC-17   114     112
PG      114     113
PG-13   114     125
R       114     115

What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films per rating range from 107 minutes to 125 minutes. I’ve used DISTINCT because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.

Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support PERCENT_RANK and FIRST_VALUE. Here’s the complete query:

SELECT DISTINCT
  rating,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (ORDER BY length) p1,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating;

So, we’re doing this in two steps (visual example further down):

  1. PERCENT_RANK: In a derived table, we’re calculating the PERCENT_RANK value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whose PERCENT_RANK is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whose PERCENT_RANK is 0.9 or less
  2. FIRST_VALUE: Once we’ve found the PERCENT_RANK, we’re not quite done yet. We need to find the last row whose PERCENT_RANK is less or equal to the percentile we’re interested in. I could have used LAST_VALUE, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows by PERCENT_RANK (p1 or p2), I translated all ranks higher than the percentile I’m looking for into NULL using a CASE expression, and then I made sure using NULLS LAST that the percentile I’m looking for will be the first row in the FIRST_VALUE function’s window specification. Easy!

To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:

SELECT
  length,
  CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
FROM (
  SELECT
    length,
    percent_rank() OVER (ORDER BY length) p1
  FROM film
) t
ORDER BY length;

The result is

length |p1   |x1  |
-------|-----|----|
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
47     |0.01 |114 |
...
113    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
...
185    |     |114 |
185    |     |114 |
185    |     |114 |

So the FIRST_VALUE function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.

The same for p2:

SELECT 
  length,
  rating,
  CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating, length;

Yielding:

length |rating |p2   |x2  |
-------|-------|-----|----|
47     |G      |0.00 |107 |
47     |G      |0.00 |107 |
48     |G      |0.01 |107 |
48     |G      |0.01 |107 |
...
105    |G      |0.47 |107 |
106    |G      |0.49 |107 |
107    |G      |0.49 |107 |
107    |G      |0.49 |107 | <-- Last row in G partition whose
108    |G      |     |107 |     PERCENT_RANK is <= 0.5
108    |G      |     |107 |
109    |G      |     |107 |
...
185    |G      |     |107 |
185    |G      |     |107 |
46     |PG     |0.00 |113 |
47     |PG     |0.01 |113 |
47     |PG     |0.01 |113 |
...
111    |PG     |0.49 |113 |
113    |PG     |0.49 |113 |
113    |PG     |0.49 |113 | <-- Last row in PG partition whose
114    |PG     |     |113 |     PERCENT_RANK is <= 0.5
114    |PG     |     |113 |
...

Perfect! Notice if your RDBMS doesn’t support the NULLS LAST clause in your ORDER BY clause (e.g. MySQL), you might either hope that it defaults to sorting NULLS LAST (MySQL does), or you can emulate it as such:

-- This
ORDER BY x NULLS LAST

-- Is the same as this
ORDER BY
  CASE WHEN x IS NULL THEN 1 ELSE 0 END,
  x

Emulating aggregate functions

If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using DISTINCT. It will probably be easier than the emulation below. Do check for performance though!

When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.

Here’s how to use the aggregate function version in Oracle:

-- Without GROUP BY
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1
FROM film;

-- With GROUP BY
SELECT
  rating,
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2
FROM film
GROUP BY rating
ORDER BY rating;

Trivial! The result is the same as before:

X1
---
114


RATING  X2
-----------
G       107
NC-17   112
PG      113
PG-13   125
R       115

Now, let’s emulate these on e.g. MySQL, using window functions.

-- Without GROUP BY
SELECT
  MAX(x1) x1
FROM (
  SELECT first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
  FROM (
    SELECT
      length,
      percent_rank() OVER (ORDER BY length) p1
    FROM film
  ) t
) t;

It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as MAX(). This is the same as what I did before with DISTINCT, for illustration purposes.

-- With GROUP BY
SELECT
  rating,
  MAX(x2) x2
FROM (
  SELECT
    rating,
    first_value(length) OVER (
      PARTITION BY rating 
      ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
  FROM (
    SELECT
      rating,
      length,
      percent_rank() OVER (
        PARTITION BY rating 
        ORDER BY length) p2
    FROM film
  ) t
) t
GROUP BY rating
ORDER BY rating;

All we’re really doing (again) is translate the GROUP BY expression to a PARTITION BY expression in the window function, and then redo the previous exercise.

Conclusion

Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the PERCENTILE_DISC inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses PERCENT_RANK and FIRST_VALUE in all RDBMS that support window functions. A similar exercise could be made with PERCENTILE_CONT with a slightly more tricky approach to finding that FIRST_VALUE, which I’ll leave as an exercise to the reader.

A future jOOQ version might emulate this for you, automatically.

Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.

Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

  • Oracle
  • PostgreSQL
  • SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;

What are we calculating here? We’re trying to find 11 different values for which we can say that:

  • 0% of the film_ids are lower than the “0%” value
  • 10% of the film_ids are lower than the “10%” value

Or in other words:

  • 0% is the MIN(film_id) value
  • 50% is the MEDIAN(film_id) value
  • 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;

We’re now getting:

0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100% 
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

SELECT amount, count(*)
FROM (
  SELECT trunc(amount) AS amount
  FROM payment
) t 
GROUP BY amount
ORDER BY amount;

… which yields:

amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

SELECT
  rating,
  count(*),
  percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);

This yields:

rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
       |1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |

So, the GROUP BY clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the COUNT(*) column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

SELECT *
FROM film
WHERE film_id = 1

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

  • Using histogram statistics
  • Hinting the optimiser (in Oracle or SQL Server)
  • Avoiding bind variables (only in extreme cases)

Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

  • Surrogate keys generated from sequences (consecutive)
  • Surrogate keys generated from UUIDs (random)
  • Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

  • Foreign keys on to-many relationships (e.g. some customers have more assets than others)
  • Numeric values (e.g. amount)
  • Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

How to Work Around ORA-38104: Columns referenced in the ON Clause cannot be updated

Standard SQL is a beautiful language. Vendor specific implementations, however, have their warts. In Oracle, for example, it’s not possible to update any columns in a MERGE statement, which have been referenced by the ON clause. For example:

CREATE TABLE person (
  id NUMBER(18) NOT NULL PRIMARY KEY,
  user_name VARCHAR2(50) NOT NULL UNIQUE,
  score NUMBER(18)
);

Now, in MySQL, we can run a non-standard INSERT .. ON DUPLICATE KEY UPDATE statement like this:

INSERT INTO person (id, user_name, score)
VALUES (1, 'foo', 100)
ON DUPLICATE KEY UPDATE
  SET user_name = 'foo', score = 100

Behind the scenes, MySQL will check all unique constraints for duplicates and reject the insert, replacing it by the update statement instead. It’s debatable whether this is really useful (ideally, we want to check only a single unique constraint for duplicates), but that’s what MySQL offers.

In case we want to run the same behaviour by Oracle, we could use the MERGE statement:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR t.user_name = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

That looks reasonable, but it doesn’t work. We’ll get:

SQL-Fehler: ORA-38104: Columns referenced in the ON Clause cannot be updated: “T”.”USER_NAME”

Obviously, this is some protection against the situation where such an update would suddenly move a row from the matched to the not matched group. In this particular example, it might not look like something that could cause problems, but if vendor specific extensions such as the WHERE or DELETE clause would be used, things might look different.

However, the parser is not very smart, in fact, it is almost not smart at all. While it detects extremely silly attempts at circumventing this limitation, such as this:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: NVL()
ON (t.id = s.id OR nvl(t.user_name, null) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

It does not detect any of these attempts:

Using row value expressions

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: row value expressions
  (t.user_name, 'dummy') = ((s.user_name, 'dummy')))
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Seemingly without any penalty on the execution plan. Both indexes are being used:

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_8626BD41 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Correlated subquery

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: correlated subquery
  (SELECT t.user_name FROM dual) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

This seems to prevent any index usage, and should thus be avoided:

----------------------------------------------------------
| Id  | Operation              | Name            | Rows  |
----------------------------------------------------------
|   0 | MERGE STATEMENT        |                 |     1 |
|   1 |  MERGE                 | PERSON          |       |
|   2 |   VIEW                 |                 |       |
|   3 |    NESTED LOOPS OUTER  |                 |     1 |
|   4 |     FAST DUAL          |                 |     1 |
|   5 |     VIEW               | VW_LAT_1846A928 |     1 |
|*  6 |      FILTER            |                 |       |
|   7 |       TABLE ACCESS FULL| PERSON          |     1 |
|   8 |       FAST DUAL        |                 |     1 |
----------------------------------------------------------

Using NVL() and updating a view instead

Just plain simple usage of NVL() inside of the ON clause didn’t work before. The parser was smart enough to detect that. But it isn’t smart enough to detect NVL() inside of a view / derived table.

MERGE INTO (
  SELECT id, user_name, nvl(user_name, null) n, score
  FROM person
) t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: renamed column
ON (t.id = s.id OR t.n = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Notice that both USER_NAME and N columns are the same thing, but the parser doesn’t notice this and thinks we’re fine.

The execution plan is still optimal, as Oracle seems to have a way to optimise NVL() expressions (but not coalesce and others!):

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_46651921 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Using the WHERE clause

If we hadn’t had an OR predicate in our ON clause, but a AND predicate, then we could have used the WHERE clause in Oracle. This works:

-- NOT the same query as the original one!
MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
  WHERE t.user_name = s.user_name
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score);

This is not the same query as the original one. I just listed it here for completeness’ sake. Also to remind readers of the fact that this approach as well doesn’t seem to use indexes optimally. Only the primary key index (from the ON clause) seems to be used. The unique key is not being used:

----------------------------------------------------------------
| Id  | Operation                      | Name          | Rows  |
----------------------------------------------------------------
|   0 | MERGE STATEMENT                |               |     1 |
|   1 |  MERGE                         | PERSON        |       |
|   2 |   VIEW                         |               |       |
|   3 |    NESTED LOOPS OUTER          |               |     1 |
|   4 |     VIEW                       |               |     1 |
|   5 |      FAST DUAL                 |               |     1 |
|   6 |     TABLE ACCESS BY INDEX ROWID| PERSON        |     1 |
|*  7 |      INDEX UNIQUE SCAN         | SYS_C00106110 |     1 |
----------------------------------------------------------------

Careful

Be careful when applying the above workarounds. Assuming that ORA-38104 is a good thing (i.e. that Oracle still thinks it should be enforced), then the above workarounds simply expose bugs in the parser, which should detect such cases. The above behaviour has been observed in Oracle 12c and 18c.

I personally believe that ORA-38104 should be abandoned entirely, and the root cause for this restriction should be removed. But it is certainly worth exploring alternative options rather than relying on the above workarounds in production code, apart from the occasional one-shot migration query, where such loop holes are always nice tools to exploit.

How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

All of these are topics covered in the training, which were immediately applicable to this problem statement.

The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

CREATE TABLE eav_classic (
  entity_type     VARCHAR (100) NOT NULL,
  entity_id       BIGINT        NOT NULL,
  attribute_name  VARCHAR (100) NOT NULL,
  attribute_type  VARCHAR (100) NOT NULL,
  attribute_value VARCHAR (100)     NULL,

  CONSTRAINT eav_classic_pk 
    PRIMARY KEY (entity_type, entity_id, attribute_name)
);

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json (
  entity_type     VARCHAR (100)   NOT NULL,
  entity_id       BIGINT          NOT NULL,
  attributes      VARCHAR (10000) NOT NULL 
    CHECK (ISJSON(attributes) = 1),

  CONSTRAINT eav_json_pk 
    PRIMARY KEY (entity_type, entity_id)
);

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history (
  id          BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
  ts          DATETIME               NOT NULL,
  entity_type VARCHAR(100)           NOT NULL,
  entity_id   BIGINT                 NOT NULL,
  delta       VARCHAR(8000)          NOT NULL 
    CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
       ('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
       ('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
       ('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
       ('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name) 
  VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe                 Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

-- SQL Server
SELECT * 
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000'; 

-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'

Side note: Do note that Oracle’s flashback query needs to be properly configured:

  • Not all data is “flashbackable”
  • DDL tends to destroy the archive
  • Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + [key] + '": ' + 
      CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;

What does this query do? Consider again our deltas at 04:00:00:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

Observe how each value has some color encoding:

  • Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
  • Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
  • Light grey: A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York

04:00:00    John          Doe          -      Zurich

At 02:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38

03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John (1)      Doe (1)
01:00:00    John          Doe          37 (3)
02:00:00    John          Doe          38 (2)
03:00:00    John          Doe          38     New York (2)
04:00:00    John          Doe          - (1)  Zurich (1)

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

SELECT [key], value, row_number() OVER (
  PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;

Notice the OUTER APPLY openjson(delta) syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

key        |value    |rn 
-----------|---------|---
age        |37       |3  
age        |38       |2  
age        |         |1  
city       |New York |2  
city       |Zurich   |1  
first_name |John     |1  
last_name  |Doe      |1  

Filtering the ones whose row number is 1:

SELECT [key], value
FROM (
  SELECT ts, [key], value, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]

This yields:

key        |value  
-----------|-------
first_name |John   
last_name  |Doe    
age        |       
city       |Zurich 

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with DISTINCT ON in PostgreSQL, or with KEEP (DENSE_RANK FIRST ORDER BY ..) in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 STRING_AGG. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With STRING_AGG, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The STRING_AGG function aggregates a CASE expression which translates different JSON data types into different formats, where:

  • 0 is NULL (and nulls are not aggregated)
  • 1 is string
  • everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + replace([key], '"', '\"') + '": ' + 
      CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;

This produces

{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

SELECT ts, (
  SELECT 
    '{' 
  + string_agg(
      CASE type WHEN 0 THEN NULL ELSE 
        '"' + replace([key], '"', '\"') + '": ' + 
        CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
      END, ', ') 
  + '}'
  FROM (
    SELECT *, row_number() OVER (
      PARTITION BY [key] ORDER BY ts DESC) rn
    FROM history
    OUTER APPLY openjson(delta)
    
    -- Apply all deltas prior to any given snapshot
    WHERE ts <= x.ts
  ) t
  WHERE rn = 1
)
FROM history x
GROUP BY ts;

It yields:

ts       |                                                                          
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}                                
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}                     
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}                     
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"} 
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}              

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.

Writing Custom Aggregate Functions in SQL Just Like a Java 8 Stream Collector

All SQL databases support the standard aggregate functions COUNT(), SUM(), AVG(), MIN(), MAX().

Some databases support other aggregate functions, like:

  • EVERY()
  • STDDEV_POP()
  • STDDEV_SAMP()
  • VAR_POP()
  • VAR_SAMP()
  • ARRAY_AGG()
  • STRING_AGG()

But what if you want to roll your own?

Java 8 Stream Collector

When using Java 8 streams, we can easily roll our own aggregate function (i.e. a Collector). Let’s assume we want to find the second highest value in a stream. The highest value can be obtained like this:

System.out.println(
    Stream.of(1, 2, 3, 4)
          .collect(Collectors.maxBy(Integer::compareTo))
) ;

Yielding:

Optional[4]

Now, what about the second highest value? We can write the following collector:

System.out.println(
    Stream.of(1, 6, 2, 3, 4, 4, 5).parallel()
          .collect(Collector.of(
              () -> new int[] { 
                  Integer.MIN_VALUE, 
                  Integer.MIN_VALUE 
              },
              (a, i) -> {
                  if (a[0] < i) {
                      a[1] = a[0];
                      a[0] = i;
                  }
                  else if (a[1] < i)
                      a[1] = i;
              },
              (a1, a2) -> {
                  if (a2[0] > a1[0]) {
                      a1[1] = a1[0];
                      a1[0] = a2[0];

                      if (a2[1] > a1[1])
                          a1[1] = a2[1];
                  }
                  else if (a2[0] > a1[1])
                      a1[1] = a2[0];

                  return a1;
              },
              a -> a[1]
          ))
) ;

It doesn’t do anything fancy. It has these 4 functions:

  • Supplier<int[]>: A supplier that provides an intermediary int[] of length 2, initialised with Integer.MIN_VALUE, each. This array will remember the MAX() value in the stream at position 0 and the SECOND_MAX() value in the stream at position 1
  • BiConsumer<int[], Integer>: A accumulator that accumulates new values from the stream into our intermediary data structure.
  • BinaryOperator<int[]>: A combiner that combines two intermediary data structures. This is used for parallel streams only.
  • Function<int[], Integer>: The finisher function that extracts the SECOND_MAX() function from the second position in our intermediary array.

The output is now:

5

How to do the same thing with SQL?

Many SQL databases offer a very similar way of calculating custom aggregate functions. Here’s how to do the exact same thing with…

Oracle:

With the usual syntactic ceremony…

CREATE TYPE u_second_max AS OBJECT (

  -- Intermediary data structure
  MAX NUMBER,
  SECMAX NUMBER,

  -- Corresponds to the Collector.supplier() function
  STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT u_second_max) RETURN NUMBER,

  -- Corresponds to the Collector.accumulate() function
  MEMBER FUNCTION ODCIAggregateIterate(self IN OUT u_second_max, value IN NUMBER) RETURN NUMBER,

  -- Corresponds to the Collector.combineer() function
  MEMBER FUNCTION ODCIAggregateMerge(self IN OUT u_second_max, ctx2 IN u_second_max) RETURN NUMBER,

  -- Correspodns to the Collector.finisher() function
  MEMBER FUNCTION ODCIAggregateTerminate(self IN u_second_max, returnValue OUT NUMBER, flags IN NUMBER) RETURN NUMBER
)
/

-- This is our "colletor" implementation
CREATE OR REPLACE TYPE BODY u_second_max IS
  STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT u_second_max)
  RETURN NUMBER IS
  BEGIN
    SCTX := U_SECOND_MAX(0, 0);
    RETURN ODCIConst.Success;
  END;

  MEMBER FUNCTION ODCIAggregateIterate(self IN OUT u_second_max, value IN NUMBER) RETURN NUMBER IS
  BEGIN
    IF VALUE > SELF.MAX THEN
      SELF.SECMAX := SELF.MAX;
      SELF.MAX := VALUE;
    ELSIF VALUE > SELF.SECMAX THEN
      SELF.SECMAX := VALUE;
    END IF;
    RETURN ODCIConst.Success;
  END;

  MEMBER FUNCTION ODCIAggregateTerminate(self IN u_second_max, returnValue OUT NUMBER, flags IN NUMBER) RETURN NUMBER IS
  BEGIN
    RETURNVALUE := SELF.SECMAX;
    RETURN ODCIConst.Success;
  END;

  MEMBER FUNCTION ODCIAggregateMerge(self IN OUT u_second_max, ctx2 IN u_second_max) RETURN NUMBER IS
  BEGIN
    IF CTX2.MAX > SELF.MAX THEN
      SELF.SECMAX := SELF.MAX;
      SELF.MAX := CTX2.MAX;
    
      IF CTX2.SECMAX > SELF.SECMAX THEN
        SELF.SECMAX := CTX2.SECMAX;
      END IF;
    ELSIF CTX2.MAX > SELF.SECMAX THEN
      SELF.SECMAX := CTX2.MAX;
    END IF;
  
    RETURN ODCIConst.Success;
  END;
END;
/

-- Finally, we have to give this aggregate function a name
CREATE FUNCTION SECOND_MAX (input NUMBER) RETURN NUMBER
PARALLEL_ENABLE AGGREGATE USING u_second_max;
/

We can now run the above on the Sakila database:

SELECT 
  max(film_id), 
  second_max(film_id) 
FROM film;

To get:

MAX     SECOND_MAX
------------------
1000    999

And what’s even better, we can use the aggregate function as a window function for free!

SELECT 
  film_id,
  length,
  max(film_id) OVER (PARTITION BY length), 
  second_max(film_id) OVER (PARTITION BY length)
FROM film
ORDER BY length, film_id;

The above yields:

FILM_ID  LENGTH  MAX   SECOND_MAX
---------------------------------
15       46      730   505
469      46      730   505
504      46      730   505
505      46      730   505
730      46      730   505
237      47      869   784
247      47      869   784
393      47      869   784
398      47      869   784
407      47      869   784
784      47      869   784
869      47      869   784
2        48      931   866
410      48      931   866
575      48      931   866
630      48      931   866
634      48      931   866
657      48      931   866
670      48      931   866
753      48      931   866
845      48      931   866
866      48      931   866
931      48      931   866

Beautiful, right?

PostgreSQL

PostgreSQL supports a slightly more concise syntax in the CREATE AGGREGATE statement. If we don’t allow for parallelism, we can write this minimal implementation:

CREATE FUNCTION second_max_sfunc (
  state INTEGER[], data INTEGER
) RETURNS INTEGER[] AS
$$
BEGIN
  IF state IS NULL THEN
    RETURN ARRAY[data, NULL];
  ELSE
    RETURN CASE 
      WHEN state[1] > data
      THEN CASE 
        WHEN state[2] > data
        THEN state
        ELSE ARRAY[state[1], data]
      END
      ELSE ARRAY[data, state[1]]
    END;
  END IF;
END;
$$ LANGUAGE plpgsql;
/

CREATE FUNCTION second_max_ffunc (
  state INTEGER[]
) RETURNS INTEGER AS
$$
BEGIN
  RETURN state[2];
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE second_max (INTEGER) (
  SFUNC     = second_max_sfunc,
  STYPE     = INTEGER[],
  FINALFUNC = second_max_ffunc
);

Here, we use the STYPE (Collector.supplier()), the SFUNC (Collector.accumulator()), and the FINALFUNC (Collector.finisher()) specifications.

Other databases

Many other databases allow for specifying user defined aggregate functions. Look up your database manual’s details to learn more. They always work in the same way as a Java 8 Collector.