How to Calculate a Cumulative Percentage in SQL

A fun report to write is to calculate a cumulative percentage. For example, when querying the Sakila database, we might want to calculate the percentage of our total revenue at any given date.

The result might look like this:

Notice the beautifully generated data. Or as raw data:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

In other words, at the beginning of our timeline, we’ve made 0% revenue, and then that percentage increases over time, until we reach 100% of our revenue at the end of our timeline.

How to do it?

We’re going to do it in two steps. Our PAYMENT table has a PAYMENT_DATE column, which is really a timestamp, i.e. the exact amount in time when we received a payment. We can query the table to see its data (I will be using PostgreSQL syntax in this post):

SELECT
  payment_date,
  amount
FROM payment
ORDER BY payment_date;

This yields:

payment_date        |amount
--------------------|------
2005-05-24 22:53:30 |2.99  
2005-05-24 22:54:33 |2.99  
2005-05-24 23:03:39 |3.99  
2005-05-24 23:04:41 |4.99  
2005-05-24 23:05:21 |6.99  
2005-05-24 23:08:07 |0.99  
2005-05-24 23:11:53 |1.99  
2005-05-24 23:31:46 |4.99  
2005-05-25 00:00:40 |4.99  
2005-05-25 00:02:21 |5.99  
2005-05-25 00:09:02 |8.99  
2005-05-25 00:19:27 |4.99  
2005-05-25 00:22:55 |6.99  
...

Now we could calculate that percentage on this timeline, but that wouldn’t be terribly interesting. We’re interested in the cumulative revenue per date, so let’s run a classic GROUP BY:

SELECT 
  CAST(payment_date AS DATE),
  sum(amount) AS amount
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

This yields the first two columns of our desired result:

payment_date |amount 
-------------|-------
2005-05-24   |29.92  
2005-05-25   |573.63 
2005-05-26   |754.26 
2005-05-27   |685.33 
2005-05-28   |804.04 
2005-05-29   |648.46 
2005-05-30   |628.42 
2005-05-31   |700.37 
...
2005-08-18   |2710.79
2005-08-19   |2615.72
2005-08-20   |2723.76
2005-08-21   |2809.41
2005-08-22   |2576.74
2005-08-23   |2523.01
2005-08-24   |514.18 

Now about that percentage. The formula in pseudo SQL is this:

cumulative_percentage[N] = SUM(amount[M <= N]) / SUM(amount[any])

In other words, the percentage of the revenue we’ve made up until a given day is equal to the SUM of all amounts until that day divided by the SUM of all amounts. We could do that relatively easily in Microsoft Excel. But we can also do it with SQL, using window functions. The syntax is:

-- Sum of all amounts until that day:
SUM(amount) OVER (ORDER BY payment_date)

-- Sum of all amounts
SUM(amount) OVER ()

So, let’s just plug that into our SQL. For simplicity, we’ll first nest our previous GROUP BY statement in a derived table:

SELECT 
  payment_date,
  amount,
  CAST(100 * sum(amount) OVER (ORDER BY payment_date) 
           / sum(amount) OVER () AS numeric(10, 2)) percentage
FROM (
  SELECT 
    CAST(payment_date AS DATE),
    sum(amount) AS amount
  FROM payment
  GROUP BY CAST(payment_date AS DATE)
) p
ORDER BY payment_date;

Running this yields the desired result:

payment_date |amount  |percentage
-------------|--------|----------
2005-05-24   |29.92   |0.04      
2005-05-25   |573.63  |0.90      
2005-05-26   |754.26  |2.01      
2005-05-27   |685.33  |3.03      
2005-05-28   |804.04  |4.22      
2005-05-29   |648.46  |5.19      
2005-05-30   |628.42  |6.12      
2005-05-31   |700.37  |7.16      
...
2005-08-18   |2710.79 |79.59     
2005-08-19   |2615.72 |83.47     
2005-08-20   |2723.76 |87.51     
2005-08-21   |2809.41 |91.67     
2005-08-22   |2576.74 |95.49     
2005-08-23   |2523.01 |99.24     
2005-08-24   |514.18  |100.00    

Bonus: Nest aggregate functions in window functions

Because of the nature of SQL syntax, and the fact that both GROUP BY and aggregate functions “happen before” window functions, i.e. they are calculated logically before window functions, we can nest aggregate functions in window functions.

This definitely doesn’t drastically improve readability, especially if you are not used to writing window functions every day. But in some more complex cases, it might help to shorten your SQL syntax. The above query is equivalent to this one:

SELECT 
  CAST(payment_date AS DATE) AS payment_date,
  sum(amount) AS amount,
  CAST(100 * sum(sum(amount)) OVER (
               ORDER BY CAST(payment_date AS DATE)) 
           / sum(sum(amount)) OVER () AS numeric(10, 2)) percentage
FROM payment
GROUP BY CAST(payment_date AS DATE)
ORDER BY CAST(payment_date AS DATE);

Beauty is in the eye of the beholder. My eye definitely likes this sum(sum(amount)) OVER () syntax. If you cannot decipher this, don’t worry. You’re not alone. I invite you to review the following post on the order of SQL operations, first.

Lesser Known jOOλ Features: Useful Collectors

jOOλ is our second most popular library. It implements a set of useful extensions to the JDK’s Stream API, which are useful especially when streams are sequential only, which according to our assumptions is how most people use streams in Java.

Such extensions include:

// (1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, ...)
Seq.of(1, 2, 3).cycle();

// tuple((1, 2, 3), (1, 2, 3))
Seq.of(1, 2, 3).duplicate();

// (1, 0, 2, 0, 3, 0, 4)
Seq.of(1, 2, 3, 4).intersperse(0);

// (4, 3, 2, 1)
Seq.of(1, 2, 3, 4).reverse();

… and many more.

Collectors

But that’s not the only thing jOOλ offers. It also ships with a set of useful Collectors, which can be used both with JDK streams, as well as with jOOλ’s Seq type. Most of them are available from the org.jooq.lambda.Agg type, where Agg stands for aggregations.

Just like the rest of jOOλ, these collectors are inspired by SQL, and you will find quite a few SQL aggregate functions represented in this class.

Here are some of these collectors:

Counting

While the JDK has Collectors.counting(), jOOλ also has a way to count distinct values, just like SQL:

// A simple wrapper for two values:
class A {
    final String s;
    final long l;
    A(String s, long l) {
        this.s = s;
        this.l = l;
    }

    static A A(String s, long l) {
        return new A(s, l);
    }
}

@Test
public void testCount() {
    assertEquals(7L, (long) 
        Stream.of(1, 2, 3, 3, 4, 4, 5)
              .collect(Agg.count()));
    assertEquals(5L, (long) 
        Stream.of(1, 2, 3, 3, 4, 4, 5)
              .collect(Agg.countDistinct()));
    assertEquals(5L, (long) 
        Stream.of(A("a", 1), 
                  A("b", 2), 
                  A("c", 3), 
                  A("d", 3), 
                  A("e", 4), 
                  A("f", 4), 
                  A("g", 5))
              .collect(Agg.countDistinctBy(a -> a.l)));
    assertEquals(7L, (long) 
        Stream.of(A("a", 1),
                  A("b", 2), 
                  A("c", 3), 
                  A("d", 3), 
                  A("e", 4), 
                  A("f", 4), 
                  A("g", 5))
              .collect(Agg.countDistinctBy(a -> a.s)));
}

These are pretty self explanatory, I think.

Percentiles

Just recently, I’ve blogged about the usefulness of SQL’s percentile functions, and how to emulate them if they’re unavailable.

Percentiles can also be nicely calculated on streams. Why not? As soon as a Stream’s contents implements Comparable, or if you supply your custom Comparator, percentiles are easy to calculate:

// Assuming a static import of Agg.percentile:
assertEquals(
    Optional.empty(), 
    Stream.<Integer> of().collect(percentile(0.25)));
assertEquals(
    Optional.of(1), 
    Stream.of(1).collect(percentile(0.25)));
assertEquals(
    Optional.of(1), 
    Stream.of(1, 2).collect(percentile(0.25)));
assertEquals(
    Optional.of(1), 
    Stream.of(1, 2, 3).collect(percentile(0.25)));
assertEquals(
    Optional.of(1), 
    Stream.of(1, 2, 3, 4).collect(percentile(0.25)));
assertEquals(
    Optional.of(2), 
    Stream.of(1, 2, 3, 4, 10).collect(percentile(0.25)));
assertEquals(
    Optional.of(2), 
    Stream.of(1, 2, 3, 4, 10, 9).collect(percentile(0.25)));
assertEquals(
    Optional.of(2), 
    Stream.of(1, 2, 3, 4, 10, 9, 3).collect(percentile(0.25)));
assertEquals(
    Optional.of(2), 
    Stream.of(1, 2, 3, 4, 10, 9, 3, 3).collect(percentile(0.25)));
assertEquals(
    Optional.of(3), 
    Stream.of(1, 2, 3, 4, 10, 9, 3, 3, 20).collect(percentile(0.25)));
assertEquals(
    Optional.of(3), 
    Stream.of(1, 2, 3, 4, 10, 9, 3, 3, 20, 21).collect(percentile(0.25)));
assertEquals(
    Optional.of(3), 
    Stream.of(1, 2, 3, 4, 10, 9, 3, 3, 20, 21, 22).collect(percentile(0.25)));

Notice that jOOλ implements SQL’s percentile_disc semantics. Also, there are 3 “special” percentiles that deserve their own names:

A variety of overloads allows for calculating:

  • The percentile of the values contained in the stream
  • The percentile of the values contained in the stream, if sorted by another value mapped by a function
  • The percentile of the values mapped to another value by a function

Mode

Speaking of statistics. What about the mode? I.e. the value that appears the most often in a stream? Easy, with Agg.mode()

assertEquals(
    Optional.of(1), 
    Stream.of(1, 1, 1, 2, 3, 4).collect(Agg.mode()));
assertEquals(
    Optional.of(1), 
    Stream.of(1, 1, 2, 2, 3, 4).collect(Agg.mode()));
assertEquals(
    Optional.of(2), 
    Stream.of(1, 1, 2, 2, 2, 4).collect(Agg.mode()));

Other useful collectors

Other collectors that can be useful occasionally are:

Combine the aggregations

And one last important feature when working with jOOλ is the capability of combining aggregations, just like in SQL. Following the examples above, I can easily calculate several percentiles in one go:

// Unfortunately, Java's type inference might need
// a little help here
var percentiles =
Stream.of(1, 2, 3, 4, 10, 9, 3, 3).collect(
  Tuple.collectors(
    Agg.<Integer>percentile(0.0),
    Agg.<Integer>percentile(0.25),
    Agg.<Integer>percentile(0.5),
    Agg.<Integer>percentile(0.75),
    Agg.<Integer>percentile(1.0)
  )
);

System.out.println(percentiles);

The result being:

(Optional[1], Optional[2], Optional[3], Optional[4], Optional[10])