I stumbled upon
a very interesting jOOQ question on Stack Overflow that required the calculation of a
weighted average. Why is that.
Problem description
Assuming you have this database (using PostgreSQL syntax):
create table transactions (
id bigint not null primary key,
lines bigint not null,
price numeric(18, 2) not null,
profit numeric(18, 2) not null
);
create table lines (
id bigint not null primary key,
transaction_id bigint not null references transactions,
total bigint not null,
quantity bigint not null,
profit numeric(18, 2) not null
);
As can be seen, this schema is slightly denormalised as the number of lines per transaction are precalculated in the
transactions.lines
column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.
Now,
in the previously linked Stack Overflow question, a report was desired that would calculate:
- An aggregation of sums as provided by the line items
- An aggregation of averages as provided by the transactions
This would be straightforward with two separate queries:
Sums provided by the line items
SELECT
sum(profit) AS total_profit,
sum(total) AS total_sales_amount,
sum(quantity) AS total_items_sold
FROM lines
Averages provided by the transactions
SELECT
avg(lines) AS avg_items_p_trx,
avg(price) AS avg_price_p_trx,
avg(profit) AS avg_profit_p_trx
FROM transactions
So far so good.
Doing it in one query
Now, these queries are simplified from the original, which needed to join the two tables in order to add additional predicates. Also, let’s assume that these tables are quite large, so running two queries might lead to the report being too slow. A single query would be much better.
We might be attempted to simply combined the two:
-- Wrong query
SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
avg(t.lines) AS avg_items_p_trx,
avg(t.price) AS avg_price_p_trx,
avg(t.profit) AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
But this query is wrong. While the sums are still correct, the averages are not, simply because the join produces duplicate transaction rows per lines. Imagine a transaction having 3 or 5 lines:
SELECT
l.id AS line_id,
t.id AS transaction_id,
t.lines,
t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
The output would be:
LINE_ID TRANSACTION_ID LINES PRICE
-------------------------------------------
1 1 3 20.00
2 1 3 20.00
3 1 3 20.00
4 2 5 100.00
4 2 5 100.00
4 2 5 100.00
4 2 5 100.00
4 2 5 100.00
- The average number of lines “avg_items_p_trx” should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate
avg(t.lines)
over the entire data set, we get 4.25 (3×3 lines + 5×5 lines) / 8 items.
- The average price “avg_price_p_trx” should be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate
avg(t.price)
over the entire data set, we get 80.00 (3×20.00 + 5×100.00) / 8 items.
How can this be fixed?
Given that each transaction is duplicated because of the join with lines, we have to calculate a
weighted average, not an ordinary average. The idea is that instead of using the
AVG()
aggregate function, we now have to divide the value we want to get an average of by the number of items (i.e. the number of times the value is repeated because of the join), and then divide the sum of that division by the number of transactions.
Prose never describes logic well, so let’s use code. The correct query is:
SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
sum(t.lines / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
With the above data set:
LINE_ID TRANSACTION_ID LINES LINES/LINES PRICE PRICE/LINES
----------------------------------------------------------------
1 1 3 1 20.00 6.66
2 1 3 1 20.00 6.66
3 1 3 1 20.00 6.66
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
We now get the correct weighted averages:
- The average number of lines “avg_items_p_trx” is now 4 =
(3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / distinct transactions
- The average price “avg_price_p_trx” is now 60.00 =
(20.00/3 + 20.00/3 + 20.00/3 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5) / 2 distinct transactions
Note that “avg_items_p_trx” can be simplified:
SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
count(*) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id
Done!
Normalised version
Notice that this solution profited from the fact that the number of lines per transaction was pre-calculated. We can of course also calculate it on the fly, e.g. using window functions. If it weren’t available, we could do it like this:
SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
count(*) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
SELECT
l.*,
count(*) OVER (PARTITION BY l.transaction_id) lines
FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id
Or, we turn the entire join into a 1:1 relationship by pre-aggregating all the data from lines into one row per transaction. This works because we only calculate sums from the lines table:
SELECT
sum(l.profit_per_transaction) AS total_profit,
sum(l.total_per_transaction) AS total_sales_amount,
sum(l.quantity_per_transaction) AS total_items_sold,
avg(l.lines_per_transaction) AS avg_items_p_trx,
avg(t.price) AS avg_price_p_trx,
avg(t.profit) AS avg_profit_p_trx
FROM (
SELECT
l.transaction_id
sum(l.profit) AS profit_per_transaction,
sum(l.total) AS total_per_transaction,
sum(l.quantity) AS quantity_per_transaction,
count(*) AS lines_per_transaction
FROM lines AS l
GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id
Like this:
Like Loading...