Calculating Weighted Averages When Joining Tables in SQL

I stumbled upon a very interesting jOOQ question on Stack Overflow that required the calculation of a weighted average. Why is that.

Problem description

Assuming you have this database (using PostgreSQL syntax):

create table transactions (
  id     bigint         not null primary key,
  lines  bigint         not null,
  price  numeric(18, 2) not null,
  profit numeric(18, 2) not null
);

create table lines (
  id             bigint         not null primary key,
  transaction_id bigint         not null references transactions,
  total          bigint         not null,
  quantity       bigint         not null,
  profit         numeric(18, 2) not null
);

As can be seen, this schema is slightly denormalised as the number of lines per transaction are precalculated in the transactions.lines column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.

Now, in the previously linked Stack Overflow question, a report was desired that would calculate:

  • An aggregation of sums as provided by the line items
  • An aggregation of averages as provided by the transactions

This would be straightforward with two separate queries:

Sums provided by the line items

SELECT
  sum(profit)   AS total_profit,
  sum(total)    AS total_sales_amount,
  sum(quantity) AS total_items_sold
FROM lines

Averages provided by the transactions

SELECT
  avg(lines)  AS avg_items_p_trx,
  avg(price)  AS avg_price_p_trx,
  avg(profit) AS avg_profit_p_trx
FROM transactions

So far so good.

Doing it in one query

Now, these queries are simplified from the original, which needed to join the two tables in order to add additional predicates. Also, let’s assume that these tables are quite large, so running two queries might lead to the report being too slow. A single query would be much better.

We might be attempted to simply combined the two:

-- Wrong query
SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  avg(t.lines)    AS avg_items_p_trx,
  avg(t.price)    AS avg_price_p_trx,
  avg(t.profit)   AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

But this query is wrong. While the sums are still correct, the averages are not, simply because the join produces duplicate transaction rows per lines. Imagine a transaction having 3 or 5 lines:

SELECT
  l.id    AS line_id,
  t.id    AS transaction_id,
  t.lines,
  t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

The output would be:

LINE_ID    TRANSACTION_ID    LINES    PRICE
-------------------------------------------
1          1                 3        20.00
2          1                 3        20.00
3          1                 3        20.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
4          2                 5       100.00
  • The average number of lines “avg_items_p_trx” should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate avg(t.lines) over the entire data set, we get 4.25 (3×3 lines + 5×5 lines) / 8 items.
  • The average price “avg_price_p_trx” should be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate avg(t.price) over the entire data set, we get 80.00 (3×20.00 + 5×100.00) / 8 items.

How can this be fixed?

Given that each transaction is duplicated because of the join with lines, we have to calculate a weighted average, not an ordinary average. The idea is that instead of using the AVG() aggregate function, we now have to divide the value we want to get an average of by the number of items (i.e. the number of times the value is repeated because of the join), and then divide the sum of that division by the number of transactions.

Prose never describes logic well, so let’s use code. The correct query is:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  sum(t.lines  / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

With the above data set:

LINE_ID  TRANSACTION_ID  LINES  LINES/LINES   PRICE  PRICE/LINES
----------------------------------------------------------------
1        1               3      1             20.00         6.66
2        1               3      1             20.00         6.66
3        1               3      1             20.00         6.66
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00
4        2               5      1            100.00        20.00

We now get the correct weighted averages:

  • The average number of lines “avg_items_p_trx” is now 4 =
    (3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / distinct transactions
  • The average price “avg_price_p_trx” is now 60.00 =
    (20.00/3 + 20.00/3 + 20.00/3 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5) / 2 distinct transactions

Note that “avg_items_p_trx” can be simplified:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

Done!

Normalised version

Notice that this solution profited from the fact that the number of lines per transaction was pre-calculated. We can of course also calculate it on the fly, e.g. using window functions. If it weren’t available, we could do it like this:

SELECT
  sum(l.profit)   AS total_profit,
  sum(l.total)    AS total_sales_amount,
  sum(l.quantity) AS total_items_sold,
  count(*)                / count(DISTINCT t.id) avg_items_p_trx,
  sum(t.price  / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
  sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
  SELECT 
    l.*,
    count(*) OVER (PARTITION BY l.transaction_id) lines
  FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

Or, we turn the entire join into a 1:1 relationship by pre-aggregating all the data from lines into one row per transaction. This works because we only calculate sums from the lines table:

SELECT
  sum(l.profit_per_transaction)   AS total_profit,
  sum(l.total_per_transaction)    AS total_sales_amount,
  sum(l.quantity_per_transaction) AS total_items_sold,
  avg(l.lines_per_transaction)    AS avg_items_p_trx,
  avg(t.price)                    AS avg_price_p_trx,
  avg(t.profit)                   AS avg_profit_p_trx
FROM (
  SELECT 
    l.transaction_id
    sum(l.profit)   AS profit_per_transaction,
    sum(l.total)    AS total_per_transaction,
    sum(l.quantity) AS quantity_per_transaction,
    count(*)        AS lines_per_transaction
  FROM lines AS l
  GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

How to Calculate a Cumulative Percentage in SQL

A fun report to write is to calculate a cumulative percentage. For example, when querying the Sakila database, we might want to calculate the percentage of our total revenue at any given date.

The result might look like this:

Notice the beautifully generated data. Or as raw data:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

In other words, at the beginning of our timeline, we’ve made 0% revenue, and then that percentage increases over time, until we reach 100% of our revenue at the end of our timeline.

How to do it?

We’re going to do it in two steps. Our PAYMENT table has a PAYMENT_DATE column, which is really a timestamp, i.e. the exact amount in time when we received a payment. We can query the table to see its data (I will be using PostgreSQL syntax in this post):

SELECT
  payment_date,
  amount
FROM payment
ORDER BY payment_date;

This yields:

payment_date        |amount
--------------------|------
2005-05-24 22:53:30 |2.99  
2005-05-24 22:54:33 |2.99  
2005-05-24 23:03:39 |3.99  
2005-05-24 23:04:41 |4.99  
2005-05-24 23:05:21 |6.99  
2005-05-24 23:08:07 |0.99  
2005-05-24 23:11:53 |1.99  
2005-05-24 23:31:46 |4.99  
2005-05-25 00:00:40 |4.99  
2005-05-25 00:02:21 |5.99  
2005-05-25 00:09:02 |8.99  
2005-05-25 00:19:27 |4.99  
2005-05-25 00:22:55 |6.99  
...

Now we could calculate that percentage on this timeline, but that wouldn’t be terribly interesting. We’re interested in the cumulative revenue per date, so let’s run a classic GROUP BY:

SELECT 
  CAST(payment_date AS DATE),
  sum(amount) AS amount
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

This yields the first two columns of our desired result:

payment_date |amount 
-------------|-------
2005-05-24   |29.92  
2005-05-25   |573.63 
2005-05-26   |754.26 
2005-05-27   |685.33 
2005-05-28   |804.04 
2005-05-29   |648.46 
2005-05-30   |628.42 
2005-05-31   |700.37 
...
2005-08-18   |2710.79
2005-08-19   |2615.72
2005-08-20   |2723.76
2005-08-21   |2809.41
2005-08-22   |2576.74
2005-08-23   |2523.01
2005-08-24   |514.18 

Now about that percentage. The formula in pseudo SQL is this:

cumulative_percentage[N] = SUM(amount[M <= N]) / SUM(amount[any])

In other words, the percentage of the revenue we’ve made up until a given day is equal to the SUM of all amounts until that day divided by the SUM of all amounts. We could do that relatively easily in Microsoft Excel. But we can also do it with SQL, using window functions. The syntax is:

-- Sum of all amounts until that day:
SUM(amount) OVER (ORDER BY payment_date)

-- Sum of all amounts
SUM(amount) OVER ()

So, let’s just plug that into our SQL. For simplicity, we’ll first nest our previous GROUP BY statement in a derived table:

SELECT 
  payment_date,
  amount,
  CAST(100 * sum(amount) OVER (ORDER BY payment_date) 
           / sum(amount) OVER () AS numeric(10, 2)) percentage
FROM (
  SELECT 
    CAST(payment_date AS DATE),
    sum(amount) AS amount
  FROM payment
  GROUP BY CAST(payment_date AS DATE)
) p
ORDER BY payment_date;

Running this yields the desired result:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

Bonus: Nest aggregate functions in window functions

Because of the nature of SQL syntax, and the fact that both GROUP BY and aggregate functions “happen before” window functions, i.e. they are calculated logically before window functions, we can nest aggregate functions in window functions.

This definitely doesn’t drastically improve readability, especially if you are not used to writing window functions every day. But in some more complex cases, it might help to shorten your SQL syntax. The above query is equivalent to this one:

SELECT 
  CAST(payment_date AS DATE) AS payment_date,
  sum(amount) AS amount,
  CAST(100 * sum(sum(amount)) OVER (
               ORDER BY CAST(payment_date AS DATE)) 
           / sum(sum(amount)) OVER () AS numeric(10, 2)) percentage
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

Beauty is in the eye of the beholder. My eye definitely likes this sum(sum(amount)) OVER () syntax. If you cannot decipher this, don’t worry. You’re not alone. I invite you to review the following post on the order of SQL operations, first.

How to Emulate PERCENTILE_DISC in MySQL and Other RDBMS

In my previous article, I showed what the very useful percentile functions (also known as inverse distribution functions) can be used for.

Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:

Dialect As aggregate function As window function
MariaDB 10.3.3 No Yes
Oracle 18c Yes Yes
PostgreSQL 11 Yes No
SQL Server 2017 No Yes
Teradata 16 Yes No

Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:

  • Aggregate function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)
  • Window function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)

Workarounds if the feature is unavailable

Luckily, as soon as an RDBMS supports window functions, we can easily emulate PERCENTILE_DISC using PERCENT_RANK and FIRST_VALUE as follows. We’re using the Sakila database in this example.

Emulating window functions

Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:

SELECT DISTINCT
  rating,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER() x1,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER (PARTITION BY rating) x2
FROM film
ORDER BY rating;

Yielding

RATING  X1      X2
-------------------
G       114     107
NC-17   114     112
PG      114     113
PG-13   114     125
R       114     115

What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films per rating range from 107 minutes to 125 minutes. I’ve used DISTINCT because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.

Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support PERCENT_RANK and FIRST_VALUE. Here’s the complete query:

SELECT DISTINCT
  rating,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (ORDER BY length) p1,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating;

So, we’re doing this in two steps (visual example further down):

  1. PERCENT_RANK: In a derived table, we’re calculating the PERCENT_RANK value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whose PERCENT_RANK is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whose PERCENT_RANK is 0.9 or less
  2. FIRST_VALUE: Once we’ve found the PERCENT_RANK, we’re not quite done yet. We need to find the last row whose PERCENT_RANK is less or equal to the percentile we’re interested in. I could have used LAST_VALUE, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows by PERCENT_RANK (p1 or p2), I translated all ranks higher than the percentile I’m looking for into NULL using a CASE expression, and then I made sure using NULLS LAST that the percentile I’m looking for will be the first row in the FIRST_VALUE function’s window specification. Easy!

To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:

SELECT
  length,
  CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
FROM (
  SELECT
    length,
    percent_rank() OVER (ORDER BY length) p1
  FROM film
) t
ORDER BY length;

The result is

length |p1   |x1  |
-------|-----|----|
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
47     |0.01 |114 |
...
113    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
...
185    |     |114 |
185    |     |114 |
185    |     |114 |

So the FIRST_VALUE function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.

The same for p2:

SELECT 
  length,
  rating,
  CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating, length;

Yielding:

length |rating |p2   |x2  |
-------|-------|-----|----|
47     |G      |0.00 |107 |
47     |G      |0.00 |107 |
48     |G      |0.01 |107 |
48     |G      |0.01 |107 |
...
105    |G      |0.47 |107 |
106    |G      |0.49 |107 |
107    |G      |0.49 |107 |
107    |G      |0.49 |107 | <-- Last row in G partition whose
108    |G      |     |107 |     PERCENT_RANK is <= 0.5
108    |G      |     |107 |
109    |G      |     |107 |
...
185    |G      |     |107 |
185    |G      |     |107 |
46     |PG     |0.00 |113 |
47     |PG     |0.01 |113 |
47     |PG     |0.01 |113 |
...
111    |PG     |0.49 |113 |
113    |PG     |0.49 |113 |
113    |PG     |0.49 |113 | <-- Last row in PG partition whose
114    |PG     |     |113 |     PERCENT_RANK is <= 0.5
114    |PG     |     |113 |
...

Perfect! Notice if your RDBMS doesn’t support the NULLS LAST clause in your ORDER BY clause (e.g. MySQL), you might either hope that it defaults to sorting NULLS LAST (MySQL does), or you can emulate it as such:

-- This
ORDER BY x NULLS LAST

-- Is the same as this
ORDER BY
  CASE WHEN x IS NULL THEN 1 ELSE 0 END,
  x

Emulating aggregate functions

If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using DISTINCT. It will probably be easier than the emulation below. Do check for performance though!

When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.

Here’s how to use the aggregate function version in Oracle:

-- Without GROUP BY
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1
FROM film;

-- With GROUP BY
SELECT
  rating,
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2
FROM film
GROUP BY rating
ORDER BY rating;

Trivial! The result is the same as before:

X1
---
114


RATING  X2
-----------
G       107
NC-17   112
PG      113
PG-13   125
R       115

Now, let’s emulate these on e.g. MySQL, using window functions.

-- Without GROUP BY
SELECT
  MAX(x1) x1
FROM (
  SELECT first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
  FROM (
    SELECT
      length,
      percent_rank() OVER (ORDER BY length) p1
    FROM film
  ) t
) t;

It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as MAX(). This is the same as what I did before with DISTINCT, for illustration purposes.

-- With GROUP BY
SELECT
  rating,
  MAX(x2) x2
FROM (
  SELECT
    rating,
    first_value(length) OVER (
      PARTITION BY rating 
      ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
  FROM (
    SELECT
      rating,
      length,
      percent_rank() OVER (
        PARTITION BY rating 
        ORDER BY length) p2
    FROM film
  ) t
) t
GROUP BY rating
ORDER BY rating;

All we’re really doing (again) is translate the GROUP BY expression to a PARTITION BY expression in the window function, and then redo the previous exercise.

Conclusion

Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the PERCENTILE_DISC inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses PERCENT_RANK and FIRST_VALUE in all RDBMS that support window functions. A similar exercise could be made with PERCENTILE_CONT with a slightly more tricky approach to finding that FIRST_VALUE, which I’ll leave as an exercise to the reader.

A future jOOQ version might emulate this for you, automatically.

Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.

Calculate Percentiles to Learn About Data Set Skew in SQL

B-Tree indexes are perfect when your data is uniformly distributed. They are not really useful, when you have skewed data. I’ll explain later why this is the case, but let’s first learn how to detect “skew”

What is skew?

Skew is a term from statistics when a normal distribution is not symmetric. The example given on Wikipedia shows a distribution like this:

In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:

  • Oracle
  • PostgreSQL
  • SQL Server (regrettably, only as window functions)

Uniform distribution

Let’s look at the FILM_ID values in the Sakila database:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY film_id) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY film_id) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY film_id) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY film_id) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY film_id) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY film_id) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY film_id) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY film_id) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY film_id) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY film_id) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY film_id) AS "100%"
FROM film;

What are we calculating here? We’re trying to find 11 different values for which we can say that:

  • 0% of the film_ids are lower than the “0%” value
  • 10% of the film_ids are lower than the “10%” value

Or in other words:

  • 0% is the MIN(film_id) value
  • 50% is the MEDIAN(film_id) value
  • 100% is the MAX(film_id) value

The result shows an unsurprisingly uniform distribution:

0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
---|----|----|----|----|----|----|----|----|----|-----|
1  |100 |200 |300 |400 |500 |600 |700 |800 |900 |1000 |

We can plot this in Microsoft Excel or some other tool to get this nice curve:

This is not surprising, as the IDs are just consecutive values, which is a desired property of surrogate keys.

“Skewed” distribution

It’s a different story when we look at the distribution of amounts in the payment table:

SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY amount) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY amount) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY amount) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY amount) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY amount) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY amount) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY amount) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY amount) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY amount) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY amount) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY amount) AS "100%"
FROM payment;

We’re now getting:

0%   |10%  |20%  |30%  |40%  |50%  |60%  |70%  |80%  |90%  |100% 
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
0.00 |0.99 |1.99 |2.99 |2.99 |3.99 |4.99 |4.99 |5.99 |6.99 |11.99

This looks … “skewed”, although clearly the bias is mainly caused by the fact that this data is generated. When we plot the above, we’re getting:

The slope is less steep at the beginning of this curve, which essentially means that more values exist at the lower end of the range than at the upper end. We can validate this with another query:

SELECT amount, count(*)
FROM (
  SELECT trunc(amount) AS amount
  FROM payment
) t 
GROUP BY amount
ORDER BY amount;

… which yields:

amount |count |
-------|------|
0      |3003  |
1      |641   |
2      |3542  |
3      |1117  |
4      |3789  |
5      |1306  |
6      |1119  |
7      |675   |
8      |486   |
9      |257   |
10     |104   |
11     |10    |

Plotted:

When plotting this, we can see that there are more amounts in the lower half of the range than in the upper half, which leads to percentiles growing slower.

Correlations

This technique can also be applied to detect correlations in data. We can, for instance, try to find the percentiles of the length of films, and group data sets by rating. I’m using a GROUPING SETS function here, the ROLLUP() function, to calculate the grand total as well. Just check out the query and its results, and you’ll see:

SELECT
  rating,
  count(*),
  percentile_disc(0.0) WITHIN GROUP (ORDER BY length) AS "0%",
  percentile_disc(0.1) WITHIN GROUP (ORDER BY length) AS "10%",
  percentile_disc(0.2) WITHIN GROUP (ORDER BY length) AS "20%",
  percentile_disc(0.3) WITHIN GROUP (ORDER BY length) AS "30%",
  percentile_disc(0.4) WITHIN GROUP (ORDER BY length) AS "40%",
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) AS "50%",
  percentile_disc(0.6) WITHIN GROUP (ORDER BY length) AS "60%",
  percentile_disc(0.7) WITHIN GROUP (ORDER BY length) AS "70%",
  percentile_disc(0.8) WITHIN GROUP (ORDER BY length) AS "80%",
  percentile_disc(0.9) WITHIN GROUP (ORDER BY length) AS "90%",
  percentile_disc(1.0) WITHIN GROUP (ORDER BY length) AS "100%"
FROM film
GROUP BY ROLLUP(rating);

This yields:

rating |count |0% |10% |20% |30% |40% |50% |60% |70% |80% |90% |100% |
-------|------|---|----|----|----|----|----|----|----|----|----|-----|
G      |178   |47 |57  |67  |80  |93  |107 |121 |138 |156 |176 |185  |
PG     |194   |46 |58  |72  |85  |99  |113 |122 |137 |151 |168 |185  |
PG-13  |223   |46 |61  |76  |92  |110 |125 |138 |150 |162 |176 |185  |
R      |195   |49 |68  |82  |90  |104 |115 |129 |145 |160 |173 |185  |
NC-17  |210   |46 |58  |74  |84  |97  |112 |125 |138 |153 |174 |184  |
       |1000  |46 |60  |74  |86  |102 |114 |128 |142 |156 |173 |185  |

So, the GROUP BY clause produced one row per rating, and an additional grand total column at the bottom. For illustration purposes, I’ve added the COUNT(*) column, to show how many films are in each group. The 5 first rows sum up to 1000, which is again the grand total at the bottom.

Let’s plot the percentiles now as line and bar charts:

We can “see” that there is no strong correlation between the two data points. Both data sets are close to uniformly distributed, quite independently of the rating, with the exception of PG-13, which is just slightly skewed towards longer film lengths.

Again, this isn’t terribly interesting as the data set was generated, probably using some randomness to avoid perfectly uniform distribution. In real world scenarios, the above data would have been more “skewed”.

How does this help with performance?

A balanced tree index is very useful when data is quite uniformly distributed, because in that case, it can help access data points or ranges of data in O(log(N)) time. This is quite a useful property for queries that look for film_id values, e.g.

SELECT *
FROM film
WHERE film_id = 1

When accessing “skewed” data, some values are more equal than others. This means that for example if we’re looking for amounts in the payment table, these two queries are not the same:

-- A lot of rows returned (3644)
SELECT * FROM payment WHERE amount BETWEEN 0 AND 2;

-- Few rows returned (361)
SELECT * FROM payment WHERE amount BETWEEN 9 AND 11;

An index on the amount column could have been useful for the second query, but maybe not for the first one.

There are several things we can do to make sure optimal index usage is being applied for all sorts of queries. In case of uniformly distributed data, we usually don’t have to do anything as SQL developers. In case of “skewed” data sets, it may be worth thinking about:

  • Using histogram statistics
  • Hinting the optimiser (in Oracle or SQL Server)
  • Avoiding bind variables (only in extreme cases)

Conclusion

Not all data sets are equal. They are often “skewed”. By “skewed”, in SQL, we don’t mean the statistical meaning of a normal distribution being skewed asymmetrically. We mean that a distribution is not uniform, so even a normal distribution is “skewed”. When it is, then some values appear way more often than others. Some examples are:

Uniform distribution

  • Surrogate keys generated from sequences (consecutive)
  • Surrogate keys generated from UUIDs (random)
  • Foreign keys on one-to-one relationships

Slight “skew”

Possibly significant “skew”

This really depends on the actual data set, but do expect significant “skew” in these data types

  • Foreign keys on to-many relationships (e.g. some customers have more assets than others)
  • Numeric values (e.g. amount)
  • Codes and other discrete values (e.g. film rating, payment settlement codes, etc.)

This article has shown how we can use simple SQL aggregate functions, including the percentiles, to calculate and visualise such “skew”.

How to Work Around ORA-38104: Columns referenced in the ON Clause cannot be updated

Standard SQL is a beautiful language. Vendor specific implementations, however, have their warts. In Oracle, for example, it’s not possible to update any columns in a MERGE statement, which have been referenced by the ON clause. For example:

CREATE TABLE person (
  id NUMBER(18) NOT NULL PRIMARY KEY,
  user_name VARCHAR2(50) NOT NULL UNIQUE,
  score NUMBER(18)
);

Now, in MySQL, we can run a non-standard INSERT .. ON DUPLICATE KEY UPDATE statement like this:

INSERT INTO person (id, user_name, score)
VALUES (1, 'foo', 100)
ON DUPLICATE KEY UPDATE
  SET user_name = 'foo', score = 100

Behind the scenes, MySQL will check all unique constraints for duplicates and reject the insert, replacing it by the update statement instead. It’s debatable whether this is really useful (ideally, we want to check only a single unique constraint for duplicates), but that’s what MySQL offers.

In case we want to run the same behaviour by Oracle, we could use the MERGE statement:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR t.user_name = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

That looks reasonable, but it doesn’t work. We’ll get:

SQL-Fehler: ORA-38104: Columns referenced in the ON Clause cannot be updated: “T”.”USER_NAME”

Obviously, this is some protection against the situation where such an update would suddenly move a row from the matched to the not matched group. In this particular example, it might not look like something that could cause problems, but if vendor specific extensions such as the WHERE or DELETE clause would be used, things might look different.

However, the parser is not very smart, in fact, it is almost not smart at all. While it detects extremely silly attempts at circumventing this limitation, such as this:

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: NVL()
ON (t.id = s.id OR nvl(t.user_name, null) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

It does not detect any of these attempts:

Using row value expressions

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: row value expressions
  (t.user_name, 'dummy') = ((s.user_name, 'dummy')))
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Seemingly without any penalty on the execution plan. Both indexes are being used:

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_8626BD41 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Correlated subquery

MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id OR 
-- Circumvention attempt here: correlated subquery
  (SELECT t.user_name FROM dual) = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

This seems to prevent any index usage, and should thus be avoided:

----------------------------------------------------------
| Id  | Operation              | Name            | Rows  |
----------------------------------------------------------
|   0 | MERGE STATEMENT        |                 |     1 |
|   1 |  MERGE                 | PERSON          |       |
|   2 |   VIEW                 |                 |       |
|   3 |    NESTED LOOPS OUTER  |                 |     1 |
|   4 |     FAST DUAL          |                 |     1 |
|   5 |     VIEW               | VW_LAT_1846A928 |     1 |
|*  6 |      FILTER            |                 |       |
|   7 |       TABLE ACCESS FULL| PERSON          |     1 |
|   8 |       FAST DUAL        |                 |     1 |
----------------------------------------------------------

Using NVL() and updating a view instead

Just plain simple usage of NVL() inside of the ON clause didn’t work before. The parser was smart enough to detect that. But it isn’t smart enough to detect NVL() inside of a view / derived table.

MERGE INTO (
  SELECT id, user_name, nvl(user_name, null) n, score
  FROM person
) t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
-- Circumvention attempt here: renamed column
ON (t.id = s.id OR t.n = s.user_name)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score)

Notice that both USER_NAME and N columns are the same thing, but the parser doesn’t notice this and thinks we’re fine.

The execution plan is still optimal, as Oracle seems to have a way to optimise NVL() expressions (but not coalesce and others!):

---------------------------------------------------------------------------
| Id  | Operation                               | Name            | Rows  |
---------------------------------------------------------------------------
|   0 | MERGE STATEMENT                         |                 |     1 |
|   1 |  MERGE                                  | PERSON          |       |
|   2 |   VIEW                                  |                 |       |
|   3 |    NESTED LOOPS OUTER                   |                 |     1 |
|   4 |     FAST DUAL                           |                 |     1 |
|   5 |     VIEW                                | VW_LAT_46651921 |     1 |
|   6 |      TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |     1 |
|   7 |       BITMAP CONVERSION TO ROWIDS       |                 |       |
|   8 |        BITMAP OR                        |                 |       |
|   9 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 10 |          INDEX RANGE SCAN               | SYS_C00106110   |       |
|  11 |         BITMAP CONVERSION FROM ROWIDS   |                 |       |
|* 12 |          INDEX RANGE SCAN               | SYS_C00106111   |       |
---------------------------------------------------------------------------

Using the WHERE clause

If we hadn’t had an OR predicate in our ON clause, but a AND predicate, then we could have used the WHERE clause in Oracle. This works:

-- NOT the same query as the original one!
MERGE INTO person t
USING (
  SELECT 1 id, 'foo' user_name, 100 score
  FROM dual
) s
ON (t.id = s.id)
WHEN MATCHED THEN UPDATE
  SET t.user_name = s.user_name, t.score = 100
  WHERE t.user_name = s.user_name
WHEN NOT MATCHED THEN INSERT (id, user_name, score)
  VALUES (s.id, s.user_name, s.score);

This is not the same query as the original one. I just listed it here for completeness’ sake. Also to remind readers of the fact that this approach as well doesn’t seem to use indexes optimally. Only the primary key index (from the ON clause) seems to be used. The unique key is not being used:

----------------------------------------------------------------
| Id  | Operation                      | Name          | Rows  |
----------------------------------------------------------------
|   0 | MERGE STATEMENT                |               |     1 |
|   1 |  MERGE                         | PERSON        |       |
|   2 |   VIEW                         |               |       |
|   3 |    NESTED LOOPS OUTER          |               |     1 |
|   4 |     VIEW                       |               |     1 |
|   5 |      FAST DUAL                 |               |     1 |
|   6 |     TABLE ACCESS BY INDEX ROWID| PERSON        |     1 |
|*  7 |      INDEX UNIQUE SCAN         | SYS_C00106110 |     1 |
----------------------------------------------------------------

Careful

Be careful when applying the above workarounds. Assuming that ORA-38104 is a good thing (i.e. that Oracle still thinks it should be enforced), then the above workarounds simply expose bugs in the parser, which should detect such cases. The above behaviour has been observed in Oracle 12c and 18c.

I personally believe that ORA-38104 should be abandoned entirely, and the root cause for this restriction should be removed. But it is certainly worth exploring alternative options rather than relying on the above workarounds in production code, apart from the occasional one-shot migration query, where such loop holes are always nice tools to exploit.

How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

All of these are topics covered in the training, which were immediately applicable to this problem statement.

The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

CREATE TABLE eav_classic (
  entity_type     VARCHAR (100) NOT NULL,
  entity_id       BIGINT        NOT NULL,
  attribute_name  VARCHAR (100) NOT NULL,
  attribute_type  VARCHAR (100) NOT NULL,
  attribute_value VARCHAR (100)     NULL,

  CONSTRAINT eav_classic_pk 
    PRIMARY KEY (entity_type, entity_id, attribute_name)
);

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json (
  entity_type     VARCHAR (100)   NOT NULL,
  entity_id       BIGINT          NOT NULL,
  attributes      VARCHAR (10000) NOT NULL 
    CHECK (ISJSON(attributes) = 1),

  CONSTRAINT eav_json_pk 
    PRIMARY KEY (entity_type, entity_id)
);

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history (
  id          BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
  ts          DATETIME               NOT NULL,
  entity_type VARCHAR(100)           NOT NULL,
  entity_id   BIGINT                 NOT NULL,
  delta       VARCHAR(8000)          NOT NULL 
    CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'),
       ('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'),
       ('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'),
       ('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'),
       ('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}')
;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name) 
  VALUES ('John', 'Doe');
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = 'New York' WHERE id = 1;
UPDATE person SET city = 'Zurich', age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe                 Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

-- SQL Server
SELECT * 
FROM Person
FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000'; 

-- Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'

Side note: Do note that Oracle’s flashback query needs to be properly configured:

  • Not all data is “flashbackable”
  • DDL tends to destroy the archive
  • Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + [key] + '": ' + 
      CASE type WHEN 1 THEN '"' + value + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 02:00:00'
) t
WHERE rn = 1;

What does this query do? Consider again our deltas at 04:00:00:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

Observe how each value has some color encoding:

  • Strong, red: The current snapshot’s attribute value, when the last delta was applied to any given attribute
  • Strong, black: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute
  • Light grey: A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38
03:00:00    John          Doe          38     New York

04:00:00    John          Doe          -      Zurich

At 02:00:00

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John          Doe
01:00:00    John          Doe          37
02:00:00    John          Doe          38

03:00:00    John          Doe          38     New York
04:00:00    John          Doe          -      Zurich

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

TIME        FIRST_NAME    LAST_NAME    AGE    CITY
------------------------------------------------------
00:00:00    John (1)      Doe (1)
01:00:00    John          Doe          37 (3)
02:00:00    John          Doe          38 (2)
03:00:00    John          Doe          38     New York (2)
04:00:00    John          Doe          - (1)  Zurich (1)

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

SELECT [key], value, row_number() OVER (
  PARTITION BY [key] ORDER BY ts DESC) rn
FROM history OUTER APPLY openjson(delta)
ORDER BY [key], ts;

Notice the OUTER APPLY openjson(delta) syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

key        |value    |rn 
-----------|---------|---
age        |37       |3  
age        |38       |2  
age        |         |1  
city       |New York |2  
city       |Zurich   |1  
first_name |John     |1  
last_name  |Doe      |1  

Filtering the ones whose row number is 1:

SELECT [key], value
FROM (
  SELECT ts, [key], value, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history OUTER APPLY openjson(delta)
) t
WHERE rn = 1
ORDER BY ts, [key]

This yields:

key        |value  
-----------|-------
first_name |John   
last_name  |Doe    
age        |       
city       |Zurich 

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with DISTINCT ON in PostgreSQL, or with KEEP (DENSE_RANK FIRST ORDER BY ..) in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 STRING_AGG. PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With STRING_AGG, you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The STRING_AGG function aggregates a CASE expression which translates different JSON data types into different formats, where:

  • 0 is NULL (and nulls are not aggregated)
  • 1 is string
  • everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

SELECT 
  '{' 
+ string_agg(
    CASE type WHEN 0 THEN NULL ELSE 
      '"' + replace([key], '"', '\"') + '": ' + 
      CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
    END, ', ') 
+ '}'
FROM (
  SELECT *, row_number() OVER (
    PARTITION BY [key] ORDER BY ts DESC) rn
  FROM history
  OUTER APPLY openjson(delta)
  
  -- Apply all deltas prior to any given snapshot
  WHERE ts <= '2000-01-01 04:00:00'
) t
WHERE rn = 1;

This produces

{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

SELECT ts, (
  SELECT 
    '{' 
  + string_agg(
      CASE type WHEN 0 THEN NULL ELSE 
        '"' + replace([key], '"', '\"') + '": ' + 
        CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END
      END, ', ') 
  + '}'
  FROM (
    SELECT *, row_number() OVER (
      PARTITION BY [key] ORDER BY ts DESC) rn
    FROM history
    OUTER APPLY openjson(delta)
    
    -- Apply all deltas prior to any given snapshot
    WHERE ts <= x.ts
  ) t
  WHERE rn = 1
)
FROM history x
GROUP BY ts;

It yields:

ts       |                                                                          
---------|--------------------------------------------------------------------------
00:00:00 |{"first_name": "John", "last_name": "Doe"}                                
01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"}                     
02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"}                     
03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"} 
04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}              

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.

How to Write a Multiplication Aggregate Function in SQL

Everyone knows the SQL SUM() aggregate function (and many people also know its window function variant).

When querying the Sakila database, we can get the daily revenue (using PostgreSQL syntax):

WITH p AS (
  SELECT
    CAST (payment_date AS DATE) AS date,
    amount
  FROM payment
)
SELECT
  date,
  SUM (amount) AS daily_revenue,
  SUM (SUM (amount)) OVER (ORDER BY date) AS cumulative_revenue
FROM p
GROUP BY date
ORDER BY date

The result will look something like this:

date       |daily_revenue |cumulative_revenue 
-----------|--------------|-------------------
2005-05-24 |29.92         |29.92              
2005-05-25 |573.63        |603.55             
2005-05-26 |754.26        |1357.81            
2005-05-27 |685.33        |2043.14            
2005-05-28 |804.04        |2847.18            
2005-05-29 |648.46        |3495.64            
2005-05-30 |628.42        |4124.06            
2005-05-31 |700.37        |4824.43            
2005-06-14 |57.84         |4882.27            
...

Doing the same with multiplication

This is already quite useful. Very occasionally, however, we do not need to aggregate multiple values in a sum (through addition), but in a product (through multiplication). I’ve just stumbled upon such a case on Stack Overflow, recently.

The question wanted to achieve the following result:

date        factor          accumulated
---------------------------------------
1986-01-10  null            1000
1986-01-13  -0.026595745    973.4042548
1986-01-14  0.005464481     978.7234036
1986-01-15  -0.016304348    962.7659569
1986-01-16  0               962.7659569
1986-01-17  0               962.7659569
1986-01-20  0               962.7659569
1986-01-21  0.005524862     968.0851061
1986-01-22  -0.005494506    962.765957
1986-01-23  0               962.765957
1986-01-24  -0.005524862    957.4468078
1986-01-27  0.005555556     962.7659569
1986-01-28  0               962.7659569
1986-01-29  0               962.7659569
1986-01-30  0               962.7659569
1986-01-31  0.027624309     989.3617013
1986-02-03  0.016129032     1005.319148
1986-02-04  0.042328041     1047.872338
1986-02-05  0.04568528      1095.744679

If this were a Microsoft Excel spreadsheet, the ACCUMULATED column would simply start with 1000 and have the following formula in all other rows:

accumulated(i) = accumulated(i - 1) * (1 + factor)

In other words (values truncated for simplicity):

1000.0 = start
 973.4 = 1000.0 * (1 - 0.026)
 978.7 =  973.4 * (1 + 0.005)
 962.7 =  978.7 * (1 - 0.016)
 962.7 =  962.7 * (1 - 0.000)
 962.7 =  962.7 * (1 - 0.000)
 962.7 =  962.7 * (1 - 0.000)
 968.0 =  962.7 * (1 + 0.005)
 ...

This is exciting because we’re not only requiring multiplicative aggregation, but even cumulative multiplicative aggregation. So, another window function.

But regrettably, SQL doesn’t offer a MUL() aggregate function, even if it were relatively simple to implement. We have two options:

  • Implementing a custom aggregate function (stay tuned for a future blog post)
  • Using a trick by summing logarithms, rather than multiplying operands directly

We’re implementing the latter for now. Check out this cool Wikipedia website about logarithmic identities, which we are going to blindly trust. In the middle of it, we have:

bx * by = bx + y

Which leads to:

logb(x * y) = logb(x) + logb(y)

How cool is that? And thus:

x * y = blogb(x) + logb(y)

So, we can define any multiplication in terms of a bunch of exponentiation to some base (say e) and logarithms to some base (say e). Or, in SQL:

x * y = EXP(LN(x) + LN(y))

Or, as an aggregate function:

MUL(x) = EXP(SUM(LN(x)))

Heh!

Our original problem can thus be solved very easily using this, as shown in my stack overflow answer:

SELECT
  date,
  factor,
  EXP(SUM(LN(1000 * (1 + COALESCE(factor, 1)))) 
       OVER (ORDER BY date)) AS accumulated
FROM t

And we get the nice result as previously shown. You may have to replace LN() by LOG() depending on your database.

Caveat: Negative numbers

Try running this:

SELECT LN(-1)

You’ll get:

SQL Error [2201E]: ERROR: cannot take logarithm of a negative number

Logarithms are defined only for strictly positive numbers, unless your database is capable of handling complex numbers as well. In case of which a single zero value would still break the aggregation.

But if your data set is defined to contain only strictly positive numbers, you’ll be fine – give or take some floating point rounding errors. Or, you’ll do some sign handling, which looks like this:

WITH v(i) AS (VALUES (-2), (-3), (-4))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(i)))) multiplication1
FROM v;

WITH v(i) AS (VALUES (-2), (-3), (-4), (-5))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(i)))) multiplication2
FROM v;

The above yielding

multiplication1      
--------------------
-23.999999999999993 


multiplication2     
-------------------
119.99999999999997 

Close enough.

Caveat: Zero

Try running this:

SELECT LN(0)

You’ll get:

SQL Error [2201E]: ERROR: cannot take logarithm of zero

Zero is different from negative numbers. A product that has a zero operand is always zero, so we should be able to handle this. We’ll do it in two steps:

  • Exclude zero values from the actual aggregation that uses EXP() and LN()
  • Add an additional CASE expression that checks if any of the operands is zero

The first step might not be necessary depending on how your database optimiser executes the second step.

WITH v(i) AS (VALUES (2), (3), (0))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
    THEN 0
    WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;

Extension: DISTINCT

Calculating the product of all DISTINCT values requires to repeat the DISTINCT keyword in 2 out of the above 3 sums:

WITH v(i) AS (VALUES (2), (3), (3))
SELECT 
  CASE 
    WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
    THEN 0
    WHEN SUM (DISTINCT CASE WHEN i < 0 THEN -1 END) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(DISTINCT LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;

The result is now:

multiplication |
---------------|
6              |

Notice that the first SUM() that checks for the presence of NULL values doesn’t require a DISTINCT keyword, so we omit it to improve performance.

Extension: Window functions

Of course, if we are able to emulate a PRODUCT() aggregate function, we’d love to turn it into a window function as well. This can be done simply by transforming each individual SUM() into a window function:

WITH v(i, j) AS (
  VALUES (1, 2), (2, -3), (3, 4), 
         (4, -5), (5, 0), (6, 0)
)
SELECT i, j, 
  CASE 
    WHEN SUM (CASE WHEN j = 0 THEN 1 END) 
      OVER (ORDER BY i) > 0
    THEN 0
    WHEN SUM (CASE WHEN j < 0 THEN -1 END) 
      OVER (ORDER BY i) % 2 < 0 
    THEN -1 
    ELSE 1 
  END * EXP(SUM(LN(ABS(NULLIF(j, 0)))) 
    OVER (ORDER BY i)) multiplication
FROM v;

The result is now:

i |j  |multiplication      |
--|---|--------------------|
1 | 2 |2                   |
2 |-3 |-6                  |
3 | 4 |-23.999999999999993 |
4 |-5 |119.99999999999997  |
5 | 0 |0                   |
6 | 1 |0                   |

So cool! The cumulative product gets bigger and bigger until it hits he first zero, from then on it stays zero.

jOOQ support

jOOQ 3.12 will support this as well and emulate it correctly on all databases:
https://github.com/jOOQ/jOOQ/issues/5939

A note on Oracle performance

Do note that Oracle is very slow to calculate LN(number_type). It can be MUCH faster to calculate, instead LN(binary_double_type). An explicit type cast produced a 100x performance improvement in a simple test, documented here.