A fun report to write is to calculate a cumulative percentage. For example, when querying the Sakila database, we might want to calculate the percentage of our total revenue at any given date.

The result might look like this:

Notice the beautifully generated data. Or as raw data:

payment_date |amount |percentage -------------|--------|---------- 2005-05-24 |29.92 |0.04 2005-05-25 |573.63 |0.90 2005-05-26 |754.26 |2.01 2005-05-27 |685.33 |3.03 2005-05-28 |804.04 |4.22 2005-05-29 |648.46 |5.19 2005-05-30 |628.42 |6.12 2005-05-31 |700.37 |7.16 ... 2005-08-18 |2710.79 |79.59 2005-08-19 |2615.72 |83.47 2005-08-20 |2723.76 |87.51 2005-08-21 |2809.41 |91.67 2005-08-22 |2576.74 |95.49 2005-08-23 |2523.01 |99.24 2005-08-24 |514.18 |100.00

In other words, at the beginning of our timeline, we’ve made 0% revenue, and then that percentage increases over time, until we reach 100% of our revenue at the end of our timeline.

### How to do it?

We’re going to do it in two steps. Our `PAYMENT`

table has a `PAYMENT_DATE`

column, which is really a timestamp, i.e. the exact amount in time when we received a payment. We can query the table to see its data (I will be using PostgreSQL syntax in this post):

SELECT payment_date, amount FROM payment ORDER BY payment_date;

This yields:

payment_date |amount --------------------|------ 2005-05-24 22:53:30 |2.99 2005-05-24 22:54:33 |2.99 2005-05-24 23:03:39 |3.99 2005-05-24 23:04:41 |4.99 2005-05-24 23:05:21 |6.99 2005-05-24 23:08:07 |0.99 2005-05-24 23:11:53 |1.99 2005-05-24 23:31:46 |4.99 2005-05-25 00:00:40 |4.99 2005-05-25 00:02:21 |5.99 2005-05-25 00:09:02 |8.99 2005-05-25 00:19:27 |4.99 2005-05-25 00:22:55 |6.99 ...

Now we could calculate that percentage on this timeline, but that wouldn’t be terribly interesting. We’re interested in the cumulative revenue *per date*, so let’s run a classic `GROUP BY`

:

SELECT CAST(payment_date AS DATE), sum(amount) AS amount FROM payment GROUP BY CAST(payment_date AS DATE) ORDER BY CAST(payment_date AS DATE);

This yields the first two columns of our desired result:

payment_date |amount -------------|------- 2005-05-24 |29.92 2005-05-25 |573.63 2005-05-26 |754.26 2005-05-27 |685.33 2005-05-28 |804.04 2005-05-29 |648.46 2005-05-30 |628.42 2005-05-31 |700.37 ... 2005-08-18 |2710.79 2005-08-19 |2615.72 2005-08-20 |2723.76 2005-08-21 |2809.41 2005-08-22 |2576.74 2005-08-23 |2523.01 2005-08-24 |514.18

Now about that percentage. The formula in pseudo SQL is this:

cumulative_percentage[N] = SUM(amount[M <= N]) / SUM(amount[any])

In other words, the percentage of the revenue we’ve made up until a given day is equal to the SUM of all amounts until that day divided by the SUM of all amounts. We could do that relatively easily in Microsoft Excel. But we can also do it with SQL, using window functions. The syntax is:

-- Sum of all amounts until that day: SUM(amount) OVER (ORDER BY payment_date) -- Sum of all amounts SUM(amount) OVER ()

So, let’s just plug that into our SQL. For simplicity, we’ll first nest our previous `GROUP BY`

statement in a derived table:

SELECT payment_date, amount, CAST(100 * sum(amount) OVER (ORDER BY payment_date) / sum(amount) OVER () AS numeric(10, 2)) percentage FROM ( SELECT CAST(payment_date AS DATE), sum(amount) AS amount FROM payment GROUP BY CAST(payment_date AS DATE) ) p ORDER BY payment_date;

Running this yields the desired result:

payment_date |amount |percentage -------------|--------|---------- 2005-05-24 |29.92 |0.04 2005-05-25 |573.63 |0.90 2005-05-26 |754.26 |2.01 2005-05-27 |685.33 |3.03 2005-05-28 |804.04 |4.22 2005-05-29 |648.46 |5.19 2005-05-30 |628.42 |6.12 2005-05-31 |700.37 |7.16 ... 2005-08-18 |2710.79 |79.59 2005-08-19 |2615.72 |83.47 2005-08-20 |2723.76 |87.51 2005-08-21 |2809.41 |91.67 2005-08-22 |2576.74 |95.49 2005-08-23 |2523.01 |99.24 2005-08-24 |514.18 |100.00

### Bonus: Nest aggregate functions in window functions

Because of the nature of SQL syntax, and the fact that both `GROUP BY`

and aggregate functions “happen before” window functions, i.e. they are calculated logically before window functions, we can nest aggregate functions in window functions.

This definitely doesn’t drastically improve readability, especially if you are not used to writing window functions every day. But in some more complex cases, it might help to shorten your SQL syntax. The above query is equivalent to this one:

SELECT CAST(payment_date AS DATE) AS payment_date, sum(amount) AS amount, CAST(100 * sum(sum(amount)) OVER ( ORDER BY CAST(payment_date AS DATE)) / sum(sum(amount)) OVER () AS numeric(10, 2)) percentage FROM payment GROUP BY CAST(payment_date AS DATE) ORDER BY CAST(payment_date AS DATE);

Beauty is in the eye of the beholder. My eye definitely likes this `sum(sum(amount)) OVER ()`

syntax. If you cannot decipher this, don’t worry. You’re not alone. I invite you to review the following post on the order of SQL operations, first.

Very good post, Lukas.

Window Functions was one of the best things I’ve known in SQL. That’s an amazing feature!!

But could you help me to understanding what is happening? I mean…

I got a little bit confused with the use of

`ORDER BY`

instead of`PARTITION BY`

in this code`... OVER (ORDER BY payment_date)`

.I my noob understanding, the

`... OVER (ORDER BY payment_date)`

would give me the same result of`... OVER ()`

, except I would get the value of`payment_date`

sorted (what does not change the final result).So using

`PARTITION BY`

instead, would the query give me the same result?No, why would it give you the same result? Partition by is for grouping, order by is for sorting. And the trick is that in the absence of a frame clause but presence of an order by clause, you get range between unbounded preceding and current row.

More details here ;-)

https://www.jooq.org/training

Or, you can try it. Download the sakila database: https://www.jooq.org/sakila, or run this query on any “payment” or “transaction” table you may have to see what’s going on.

a-ha!!! That’s what I was trying to understanding and you answered me:

Thanks again, Lukas!

Yep. It’s a bit weird, but it does make sense as a default, if you think about it. The default in the absence of order by is range between unbounded preceding and unbounded following, which also makes sense.