NoSQL? No, SQL! – How to Calculate Running Totals


If anything at all, our jOOQ talks at various JUGs and conferences have revealed mostly one thing:

Java developers don’t know SQL.

And it isn’t even necessarily our fault. We’re just not exposed to SQL nowadays.

But consider this: We developers (or our customers) are paying millions of dollars every year to Oracle, Microsoft, IBM, SAP for their excellent RDBMS, only to ignore 90% of their database features and to perform a little bit of CRUD and a little bit of ACID with ORMs like Hibernate. We have forgotten about why those RDBMS were so expensive in the first place. We haven’t paid attention to the various improvements to the SQL standards, including SQL:1999, SQL:2003, SQL:2008, and the recent SQL:2011, which are all mostly left unsupported by JPA.

If only we Java developers knew how easy and how much fun it can be to replace thousands of lines of erroneous Java code with five lines of SQL. Don’t believe it? Check this out:

Promoting SQL, the Language

Instead of simply promoting jOOQ, we started to help Java developers appreciate actual SQL, regardless of the access pattern that they’re using. Because true SQL can be appreciated through any of these APIs:

jOOQ: The best way to use Oracle AQ in Java

How the above APIs can be leveraged in Java 8 can be seen here.

And believe us, most developers were astonished by what was possible in SQL, when they saw our NoSQL? No, SQL! talk:

Calculating a Running Total

So let’s delve into the essence of the talk and calculate a running total with SQL. What’s a running total? It’s easy. Imagine you have these bank account transaction data in your database:

| ID   | VALUE_DATE | AMOUNT |
|------|------------|--------|
| 9997 | 2014-03-18 |  99.17 |
| 9981 | 2014-03-16 |  71.44 |
| 9979 | 2014-03-16 | -94.60 |
| 9977 | 2014-03-16 |  -6.96 |
| 9971 | 2014-03-15 | -65.95 |

You’ll notice immediately, that the balance per account transaction is missing. Yes, we want to calculate that balance as such:

| ID   | VALUE_DATE | AMOUNT |  BALANCE |
|------|------------|--------|----------|
| 9997 | 2014-03-18 |  99.17 | 19985.81 |
| 9981 | 2014-03-16 |  71.44 | 19886.64 |
| 9979 | 2014-03-16 | -94.60 | 19815.20 |
| 9977 | 2014-03-16 |  -6.96 | 19909.80 |
| 9971 | 2014-03-15 | -65.95 | 19916.76 |

If we’re assuming that we know the current balance on the bank account, we can use each account transaction’s AMOUNT value and subtract it from that current balance. Alternatively, we could assume an initial balance of zero and add up all the AMOUNT value till today. This is illustrated here:

| ID   | VALUE_DATE | AMOUNT |  BALANCE |
|------|------------|--------|----------|
| 9997 | 2014-03-18 |  99.17 | 19985.81 |
| 9981 | 2014-03-16 | +71.44 |=19886.64 | n
| 9979 | 2014-03-16 | -94.60 |+19815.20 | n + 1
| 9977 | 2014-03-16 |  -6.96 | 19909.80 |
| 9971 | 2014-03-15 | -65.95 | 19916.76 |

Each transaction’s balance can be calculated through either one of these formulas:

BALANCE(ROWn) = BALANCE(ROWn+1) + AMOUNT(ROWn)
BALANCE(ROWn+1) = BALANCE(ROWn) – AMOUNT(ROWn)

So, that’s a running total. Easy, right?

But how can we do it in SQL?

Most of us would probably pull out a little Java programme out of their sleeves, keeping all the amounts in memory, writing unit tests, fixing all sorts of bugs (we’re not mathematicians, after all), wrestling with BigDecimals, etc. Few of us would probably go through the hassle of doing the same in PL/SQL or T-SQL, or whatever other procedural language you have at disposition, and possibly update each balance directly into the table when inserting / updating new transactions.

But as you might have guessed so far, the solution we’re looking for here is a solution in SQL. Please bear with us as we’re going through the examples. They’re getting better and better. And if you want to play around with the examples, download Oracle XE and the running-totals.sql script and get up and running!

What we’ve learned from college / SQL-92 would probably involve a…

Using Nested SELECT

Let’s assume that we have a view like v_transactions, which already joins the accounts table to the account transactions table in order to access the current_balance. Here’s how we would write this query, then:

SELECT
  t1.*,
  t1.current_balance - (
    SELECT NVL(SUM(amount), 0)
    FROM v_transactions t2
    WHERE t2.account_id = t1.account_id
    AND  (t2.value_date, t2.id) >
         (t1.value_date, t1.id)
  ) AS balance
FROM     v_transactions t1
WHERE    t1.account_id = 1
ORDER BY t1.value_date DESC, t1.id DESC

Notice how the nested SELECT uses row value expression predicates to express the filtering criteria. If your database doesn’t support the SQL standard row value expression predicates (and you’re not using jOOQ to emulate them), you can factor them out yourself to form this equivalent query, instead:

SELECT
  t1.*,
  t1.current_balance - (
    SELECT NVL(SUM(amount), 0)
    FROM v_transactions t2
    WHERE t2.account_id = t1.account_id
    AND ((t2.value_date > t1.value_date) OR
         (t2.value_date = t1.value_date AND
          t2.id         > t1.id))
  ) AS balance
FROM     v_transactions t1
WHERE    t1.account_id = 1
ORDER BY t1.value_date DESC, t1.id DESC

So in essence, for any given account transaction, your nested SELECT simply fetches the sum of all AMOUNT values for account transactions that are more recent than the currently projected account transaction.

| ID   | VALUE_DATE |  AMOUNT |  BALANCE |
|------|------------|---------|----------|
| 9997 | 2014-03-18 | -(99.17)|+19985.81 |
| 9981 | 2014-03-16 | -(71.44)| 19886.64 |
| 9979 | 2014-03-16 |-(-94.60)| 19815.20 |
| 9977 | 2014-03-16 |   -6.96 |=19909.80 |
| 9971 | 2014-03-15 |  -65.95 | 19916.76 |

Does it perform?

Execution plan for nested SELECT

Execution plan for nested SELECT

Nope. As you can see, for the relatively simple sample data set (only 1101 records filtered from account_id = 1 in line 9), there is an INDEX RANGE SCAN materialising a whopping total of 1212K rows in memory. This looks like we have O(n2) complexity. I.e. a very naïve algorithm is being applied.

(and don’t think that 770ms is fast for this trivial query!)

While you could probably tune this query slightly, we should still feel that Oracle should be able to devise an O(n) algorithm for this simple task.

Using Recursive SQL

No one enjoys writing recursive SQL. No one. Let me convince you.

For simplicity, we’re assuming that we also have a TRANSACTION_NR column enumerating transactions in their sort order, which can be used to simplify recursion:

| ID   | VALUE_DATE | AMOUNT | TRANSACTION_NR |
|------|------------|--------|----------------|
| 9997 | 2014-03-18 |  99.17 |              1 |
| 9981 | 2014-03-16 |  71.44 |              2 |
| 9979 | 2014-03-16 | -94.60 |              3 |
| 9977 | 2014-03-16 |  -6.96 |              4 |
| 9971 | 2014-03-15 | -65.95 |              5 |

Ready? Check out this gorgeous piece of SQL!

WITH ordered_with_balance (
  account_id, value_date, amount, 
  balance, transaction_number
)
AS (
  SELECT t1.account_id, t1.value_date, t1.amount, 
         t1.current_balance, t1.transaction_number
  FROM   v_transactions_by_time t1
  WHERE  t1.transaction_number = 1

  UNION ALL

  SELECT t1.account_id, t1.value_date, t1.amount, 
         t2.balance - t2.amount, t1.transaction_number
  FROM   ordered_with_balance t2
  JOIN   v_transactions_by_time t1
  ON     t1.transaction_number = 
         t2.transaction_number + 1
  AND    t1.account_id = t2.account_id
)
SELECT   *
FROM     ordered_with_balance
WHERE    account_id= 1
ORDER BY transaction_number ASC

Ach… How to read this beauty?

Essentially, we’re self-joining the view (common table expression) that we’re about to declare:

WITH ordered_with_balance (
  account_id, value_date, amount, 
  balance, transaction_number
)
AS (
  SELECT t1.account_id, t1.value_date, t1.amount, 
         t1.current_balance, t1.transaction_number
  FROM   v_transactions_by_time t1
  WHERE  t1.transaction_number = 1

  UNION ALL

  SELECT t1.account_id, t1.value_date, t1.amount, 
         t2.balance - t2.amount, t1.transaction_number
  FROM   ordered_with_balance t2
  JOIN   v_transactions_by_time t1
  ON     t1.transaction_number = 
         t2.transaction_number + 1
  AND    t1.account_id = t2.account_id
)
SELECT   *
FROM     ordered_with_balance
WHERE    account_id= 1
ORDER BY transaction_number ASC

In the first subselect of the UNION ALL expression, we’re projecting the current_balance of the account, only for the first transaction_number.

In the second subselect of the UNION ALL expression, we’re projecting the difference of the balance of the previous account transaction and the AMOUNT of the current account transaction.

WITH ordered_with_balance (
  account_id, value_date, amount, 
  balance, transaction_number
)
AS (
  SELECT t1.account_id, t1.value_date, t1.amount, 
         t1.current_balance, t1.transaction_number
  FROM   v_transactions_by_time t1
  WHERE  t1.transaction_number = 1

  UNION ALL

  SELECT t1.account_id, t1.value_date, t1.amount, 
         t2.balance - t2.amount, t1.transaction_number
  FROM   ordered_with_balance t2
  JOIN   v_transactions_by_time t1
  ON     t1.transaction_number = 
         t2.transaction_number + 1
  AND    t1.account_id = t2.account_id
)
SELECT   *
FROM     ordered_with_balance
WHERE    account_id= 1
ORDER BY transaction_number ASC

And because we’re recursing into the ordered_with_balance common table expression, this will continue until we reach the “last” transaction.

Now let’s make an educated guess, whether this performs well…

Execution plan for recursive SQL

Execution plan for recursive SQL

Well. It doesn’t. We get even more rows in memory, namely 11M rows for what should be at most 1101. Parts of this plan are due to the fact that the TRANSACTION_NUMBER utility column is another calculated column that couldn’t be optimised by Oracle. But the essence here is the fact that it is already very hard to get it right, it’s even harder to get it fast.

Using Window Functions

So, we’ve suffered enough. Let’s hear some good news.

tweet thisThere is SQL before window functions, and there is SQL after window functions
- Dimitri Fontaine in this great post

The best solution for this problem is this one:

SELECT
  t.*,
  t.current_balance - NVL(
    SUM(t.amount) OVER (
      PARTITION BY t.account_id
      ORDER BY     t.value_date DESC,
                   t.id         DESC
      ROWS BETWEEN UNBOUNDED PRECEDING
           AND     1         PRECEDING
    ),
  0) AS balance
FROM     v_transactions t
WHERE    t.account_id = 1
ORDER BY t.value_date DESC,
         t.id         DESC

Essentially, we’re doing exactly the same thing as with the nested SELECT. We’re subtracting the SUM() of all AMOUNT values “over” the subset of rows that is:

  • in the same PARTITION as the current row (i.e. has the same account_id)
  • ordered by the same ordering criteria as the account transactions (from the outer query)
  • positioned strictly before the current row in the sense of the above ordering

Or, again, visually:

| ID   | VALUE_DATE |  AMOUNT |  BALANCE |
|------|------------|---------|----------|
| 9997 | 2014-03-18 | -(99.17)|+19985.81 |
| 9981 | 2014-03-16 | -(71.44)| 19886.64 |
| 9979 | 2014-03-16 |-(-94.60)| 19815.20 |
| 9977 | 2014-03-16 |   -6.96 |=19909.80 |
| 9971 | 2014-03-15 |  -65.95 | 19916.76 |

And now, does this perform?

Execution plan for window functions

Execution plan for window functions

Hallelujah!

It couldn’t be much faster! Window functions are probably the most underestimated SQL feature.

Using the Oracle MODEL Clause

Now, this is more of a special treat for those SQL nerds among you who want to piss off your fellow developers with eerie, weird-looking SQL. The MODEL clause (only available in Oracle).

SELECT account_id, value_date, amount, balance
FROM (
  SELECT id, account_id, value_date, amount,
         current_balance AS balance
  FROM   v_transactions
) t
WHERE account_id = 1
MODEL
  PARTITION BY (account_id)
  DIMENSION BY (
    ROW_NUMBER() OVER (
      ORDER BY value_date DESC, id DESC
    ) AS rn
  )
  MEASURES (value_date, amount, balance)
  RULES (
    balance[rn > 1] = balance[cv(rn) - 1] 
                    - amount [cv(rn) - 1]
  )
ORDER BY rn ASC

Now, how to read this beast? We’re taking the sample data and transforming it to:

  • be PARTITION‘ed by the usual criteria
  • be DIMENSION‘ed along the sort order, i.e. the transaction row number
  • be MEASURE‘ed, i.e. to provide calculated values for date, amount, balance (where date and amount remain untouched, original data)
  • be calculated according to RULES, which define the balance of each transaction (except for the first one) to be the balance of the previous transaction minus the amount of the previous transaction

Still too abstract? I know. But think about it this way:

Does it remind you of something?

Does it remind you of something?

MS Excel! Every time you have a problem that your project manager thinks is peanuts to solve with his fancy MS Excel spreadsheets, then the MODEL clause is your friend!

And does it perform?

Execution plan for the MODEL clause

Execution plan for the MODEL clause

… pretty much so (although the above shouldn’t be confused with an actual benchmark).

If you haven’t seen enough, see another great use-case for the MODEL clause here. And for all the details, consider reading Oracle’s official MODEL clause whitepaper

Do it yourself

Did you like the above examples? Don’t worry. When you’ve seen these things for the first time, they can be pretty confusing. But in essence, they’re really not that complicated. And once you have those features in your tool-chain, you will be indefinitely more productive than if you had to write out all those algorithms in an imperative programming language.

If you want to play around with the examples, download Oracle XE and the running-totals.sql script and get up and running!

Conclusion

At Data Geekery, we always say:

SQL is a device whose mystery is only exceeded by its power

And sometimes, we also quote Winston Churchill for having said something along the lines of:

tweet thisSQL is the worst form of database querying, except for all the other forms

jOOQ: The best way to use Oracle AQ in Java

Indeed, as the MODEL clause has shown us, SQL can become very extreme. But once you know the tricks and expressions (and most importantly, window functions), you’ll be incredibly more productive with SQL than with many other technologies, if the task at hand is a bulk calculation operation on a simple or complex data set. And your query is often faster than if you had hand-written it, at least when you’re using a decent database.

So let’s put SQL to action in our software!

Are you interested in hosting our NoSQL? No, SQL! talk at your local JUG or as an in-house presentation? Contact us, we’re more than happy to help you improve your SQL skills!

Tags: , , , , , , , ,

8 responses to “NoSQL? No, SQL! – How to Calculate Running Totals”

  1. Maaartinus says :

    To me it rather looks like there’s something wrong with SQL when it forces us to write such monsters. A readable syntax could be

    FROM v_transactions AS t1
    WHERE t1.account_id = 1
    ORDER BY t1.value_date DESC, t1.id DESC
    SELECT t1.*, RUNNING SUM(amount)

    Sure, it’s no SQL, but any programmer would see what’s going on (`RUNNING` means computed over all rows seen so far).

    I can see that both window function and `MODEL` are more powerful, but it looks like their power is more then offset by their mystery and verbosity.

    • lukaseder says :

      SQL is a beast, sure. However, don’t underestimate the power of window functions. Introducing just a simple RUNNING keyword is not enough. You would still need to be able to specify:

      • partitions
      • alternative order specifications
      • frames

      These three aspects of window functions are incredibly useful. Every alternative syntax will inevitably become as “complex” as the existing one, if you want to accommodate the existing feature set.

      I strongly suggest reading this great piece to get more insight: http://tapoueh.org/blog/2013/08/20-Window-Functions.

      I can see that both window function and `MODEL` are more powerful, but it looks like their power is more then offset by their mystery and verbosity.

      The wonders of declarative programming… For these very reasons, no declarative (but not functional) programming language other than SQL has ever met mainstream. Excessive usage of Java annotations shows how complex declarative programming can get, if the domain is not very well-defined as in the case of SQL.

      • Maaartinus says :

        Thank you for the link, it explains a lot. But there must a *different* reason why SQL is considered unintelligible (by many including myself). It reads like a mathematical proof with permuted equations or a GOTO-dominated program. That’s just a feeling I can’t describe precisely.

        • lukaseder says :

          GOTO? I think GOTO is at the extreme opposite of the spectrum. SQL is simply a rather sophisticated declarative language, which is unusual in a world where most people write imperative programs (with GOTO, if they could).

  2. Adrianos Dadis says :

    Great post :)

    You are right, (most) Java developers do not know SQL (and do not like it too).

    In my experience (working mostly with Java), there are special cases that Java is NOT the right place to do your calculations. If working with relational databases then SQL is a strong player. I can admit that some times our DBA have have save my life with complex queries like the above.

    In my opinion, when working with Java and relational DB then for complex queries you have to always try it using SQL or ask for an advice by an SQL/DBA expert.

    Regards,
    Adrianos.

    • lukaseder says :

      Thanks for the shout! Yes, having a good DBA on the team, or a developer who likes both worlds ( ;-) ) is a winning strategy in many cases.

  3. Per Persson says :

    In MySQL you can use variables and do things like:

    SET @balance := 12345.67;

    SELECT ID, VALUE_DATE, AMOUNT, (@balance := @balance + AMOUNT) AS BALANCE
    FROM …

    • lukaseder says :

      That’s clever, thanks for sharing. It works for this case, but using window functions allows for describing a relation that effectively contains the balance as a column. That relation can be used in derived tables, correlated subqueries, etc. In fact, I guess a solution using an external variable even relies on the ORDER BY clause remaining unmodified.

      As soon as you leave the relational / SQL level to work with a procedural level, expessivity is lowered considerably.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,955 other followers

%d bloggers like this: