These two SQL standard language features are among the most powerful SQL features that are available from most other databases. I frequently include them in conference talks about SQL (see my article about 10 SQL Tricks That You Didn’t Think Were Possible), and as well in the Data Geekery SQL Masterclass. With MySQL 8.0 now supporting these exciting features, the masterclass will be including MySQL as well (along with Oracle, SQL Server, PostgreSQL, and DB2). And, of course, these features are now supported in the upcoming jOOQ 3.10 as well.

Then, connect to this instance and run this nice little query in it:

WITH RECURSIVE t(a, b) AS (
SELECT 1, CAST('a' AS CHAR(15))
UNION ALL
SELECT t.a + 1, CONCAT(t.b, 'a')
FROM t
WHERE t.a < 10
)
SELECT a, SUM(a) OVER (ORDER BY a) AS ∑, b
FROM t

And get this result:

a ∑ b
--------------------------
1 1 a
2 3 aa
3 6 aaa
4 10 aaaa
5 15 aaaaa
6 21 aaaaaa
7 28 aaaaaaa
8 36 aaaaaaaa
9 45 aaaaaaaaa
10 55 aaaaaaaaaa

Would you believe this is MySQL?

Bonus

A nice “hidden” feature is the support of new pessimistic locking clauses, in particular FOR UPDATE SKIP LOCKED. This has been available in Oracle for ages and since recently in PostgreSQL as well, and now in MySQL. A very useful feature when implementing simple message queues or reservation systems. More details in this article here:

One feature that I rarely see in the wild (even if it is extremely useful for reporting) is called “logical windowing” in Oracle, and it’s most useful when used with INTERVAL ranges. Let’s see what we may want to do.

As you can see, in column #1, we’re getting always the same number of payments for any given hour. If we were using GROUP BY trunc(payment_date, 'HH24'), it would be simple to see that this would the result we were looking for:

Column #2 is a bit more sophisticated as it will also aggregatee the number of payments per hour, but only those that are before any given payment. For instance, in the hour of 24.05.05 23:00:00, we have 6 payments (see above), and those are the different payment numbers within that hour. E.g. 24.05.05 23:11:53 is the fifth payment within that hour:

Finally, column #3 is using a sliding interval. For each payment, it checks how many payments were there in the time range between the current payment and one hour before. Now, excuse my ASCII art:

All of these are really useful for reporting, and all of these can be done rather easily with window functions. I’m using Oracle syntax here:

SELECT
CAST(payment_date AS TIMESTAMP) ts,
-- Column 1: Payments in the same hour
COUNT(*) OVER (
PARTITION BY TRUNC(payment_date, 'HH24')
),
-- Column 2: Preceding payments in the same hour
COUNT(*) OVER (
PARTITION BY TRUNC(payment_date, 'HH24')
ORDER BY payment_date
),
-- Column 3: Preceding payments within one hour
COUNT(*) OVER (
ORDER BY payment_date
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)
FROM payment
ORDER BY ts

Remember how column #1 is essentially the same thing as running a GROUP BY operation? That’s simple with window functions too. Just use a single PARTITION BY clause, and you get that grouping by the hour:

COUNT(*) OVER (
PARTITION BY TRUNC(payment_date, 'HH24')
)

Column #2 is more interesting. Because we’re ordering the window, and because we’re not putting any explicit frame clause (ROWS or RANGE), by default, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is implied. The following two are the same:

COUNT(*) OVER (
PARTITION BY TRUNC(payment_date, 'HH24')
ORDER BY payment_date
)
COUNT(*) OVER (
PARTITION BY TRUNC(payment_date, 'HH24')
ORDER BY payment_date
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

In other words, we group (PARTITION BY) all payments by the hour, order each group by the actual date, and limit / frame the window to the payments that preceed the current payment.

Column #3 is the most interesting and the least known, however. Unfortunately, it is not yet supported in all databases, even if it’s part of the SQL standard.

COUNT(*) OVER (
ORDER BY payment_date
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)

We’re no longer using the PARITION BY clause, because we don’t want to group payments into hours. Instead, we want to look behind one hour from each individual payment to see how many payments there were in that hour. But hours now don’t start / end at 00:00 minutes/seconds, they end in a given payment’s payment_date column (where Oracle DATE is really a timestamp).

This is a really nice feature that is available with DATE or TIMESTAMP columns only, when you pass an INTERVAL data type to the frame clause. I bet, however, that most of you have already had 1-2 reports where this might have been useful to know!

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.

Chris, you’re part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom’s website at least once. You’ve answered an incredible amount of questions already. What’s it like to work for such a big community as Oracle’s?

It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.

The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!

What were your three most interesting questions, so far?

Any question that has a clear definition and a complete test case is interesting! ;) Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:

The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.

At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.

I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.

This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!

I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.

The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.

The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.

There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.

The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.

To overcome these you need to ask: “how can I join to the abbreviation table just once?”

The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.

You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!

The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:

The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.

The query in question was in the form:

update table1
set col1 = (select col2 from table2 where t1.code = t2.code);

It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!

The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.

The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!

The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).

This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:

merge into table1
using table2
on (t1.code = t2.code)
when matched
then update set t1.col = t2.col;

This came out significantly faster than both the original update and PL/SQL – without needing any extra indexes!

You’re running a YouTube channel called “The Magic of SQL”. Are SQL developers magicians?

Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:

“Any sufficiently advanced technology is indistinguishable from magic”

The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!

I think SQL developers have a special kind of magic though :). The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.

Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they’ve been in databases like Oracle for a very long time. Why do you think they’re still so “obscure”?

Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.

Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.

Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that? ;)

What are three things that every developer should know about SQL?

1. Understand set based processing

If you find yourself writing a cursor loop (select … from … loop), and inside that loop you run more SQL, you’re doing it wrong.

Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?

If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.

It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.

If a ship_order function exists, you could write something like:

select order_id from unshipped_orders loop
ship_order ( order_id );
end loop;

The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.

The way to make this faster is to process all the orders in one go. You can do this with SQL like:

insert into shipments
select … from unshipped_orders;
update unshipped_orders
set shipment_date = sysdate;

You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.

But you can overcome this! With update’s returning clause, you can get values from all the changed rows:

update unshipped_orders
set shipment_date = sysdate
returning order_id bulk collect into order_array;

This gives you all the order ids to use as you need.

2. Learn what an execution plan is and how to create and read one

“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!

For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!

So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.

An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.

3. Use bind variables

Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.

This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.

And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!

Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.

Make sure your company is not tomorrow’s data leak headline. Use bind variables!

Last but not least: When will Oracle have a BOOLEAN type? :)

We have a BOOLEAN type! It’s just only in PL/SQL ;P

There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises ;)

Listicles like these do work – not only do they attract attention, if the content is also valuable (and in this case it is, trust me), the article format can be extremely entertaining.

This article will bring you 10 SQL tricks that many of you might not have thought were possible. The article is a summary of my new, extremely fast-paced, ridiculously childish-humoured talk, which I’m giving at conferences (recently at JAX, and Devoxx France. You may quote me on this:

Here are 10 SQL Tricks That You Didn’t Think Were Possible:

Introduction

In order to understand the value of these 10 SQL tricks, it is first important to understand the context of the SQL language. Why do I talk about SQL at Java conferences? (and I’m usually the only one!) This is why:

From early days onwards, programming language designers had this desire to design languages in which you tell the machine WHAT you want as a result, not HOW to obtain it. For instance, in SQL, you tell the machine that you want to “connect” (JOIN) the user table and the address table and find the users that live in Switzerland. You don’t care HOW the database will retrieve this information (e.g. should the users table be loaded first, or the address table? Should the two tables be joined in a nested loop or with a hashmap? Should all data be loaded in memory first and then filtered for Swiss users, or should we only load Swiss addresses in the first place? Etc.)

As with every abstraction, you will still need to know the basics of what’s going on behind the scenes in a database to help the database make the right decisions when you query it. For instance, it makes sense to:

Establish a formal foreign key relationship between the tables (this tells the database that every address is guaranteed to have a corresponding user)

Add an index on the search field: The country (this tells the database that specific countries can be found in O(log N) instead of O(N))

But once your database and your application matures, you will have put all the important meta data in place and you can focus on your business logic only. The following 10 tricks show amazing functionality written in only a few lines of declarative SQL, producing simple and also complex output.

1. Everything is a Table

This is the most trivial of tricks, and not even really a trick, but it is fundamental to a thorough understanding of SQL: Everything is a table! When you see a SQL statement like this:

SELECT *
FROM person

… you will quickly spot the table person sitting right there in the FROM clause. That’s cool, that is a table. But did you realise that the whole statement is also a table? For instance, you can write:

SELECT *
FROM (
SELECT *
FROM person
) t

And now, you have created what is called a “derived table” – i.e. a nested SELECT statement in a FROM clause.

That’s trivial, but if you think of it, quite elegant. You can also create ad-hoc, in-memory tables with the VALUES() constructor as such, in some databases (e.g. PostgreSQL, SQL Server):

SELECT *
FROM (
VALUES(1),(2),(3)
) t(a)

Which simply yields:

a
---
1
2
3

If that clause is not supported, you can revert to derived tables, e.g. in Oracle:

SELECT *
FROM (
SELECT 1 AS a FROM DUAL UNION ALL
SELECT 2 AS a FROM DUAL UNION ALL
SELECT 3 AS a FROM DUAL
) t

Now that you’re seeing that VALUES() and derived tables are really the same thing, conceptually, let’s review the INSERT statement, which comes in two flavours:

-- SQL Server, PostgreSQL, some others:
INSERT INTO my_table(a)
VALUES(1),(2),(3);
-- Oracle, many others:
INSERT INTO my_table(a)
SELECT 1 AS a FROM DUAL UNION ALL
SELECT 2 AS a FROM DUAL UNION ALL
SELECT 3 AS a FROM DUAL

In SQL everything is a table. When you’re inserting rows into a table, you’re not really inserting individual rows. You’re really inserting entire tables. Most people just happen to insert a single-row-table most of the time, and thus don’t realise what INSERT really does.

Everything is a table. In PostgreSQL, even functions are tables:

SELECT *
FROM substring('abcde', 2, 3)

The above yields:

substring
---------
bcd

If you’re programming in Java, you can use the analogy of the Java 8 Stream API to take this one step further. Consider the following equivalent concepts:

TABLE : Stream<Tuple<..>>
SELECT : map()
DISTINCT : distinct()
JOIN : flatMap()
WHERE / HAVING : filter()
GROUP BY : collect()
ORDER BY : sorted()
UNION ALL : concat()

With Java 8, “everything is a Stream” (as soon as you start working with Streams, at least). No matter how you transform a stream, e.g. with map() or filter(), the resulting type is always a Stream again.

Common Table Expressions (also: CTE, also referred to as subquery factoring, e.g. in Oracle) are the only way to declare variables in SQL (apart from the obscure WINDOW clause that only PostgreSQL and Sybase SQL Anywhere know).

This is a powerful concept. Extremely powerful. Consider the following statement:

-- Table variables
WITH
t1(v1, v2) AS (SELECT 1, 2),
t2(w1, w2) AS (
SELECT v1 * 2, v2 * 2
FROM t1
)
SELECT *
FROM t1, t2

It yields

v1 v2 w1 w2
-----------------
1 2 2 4

Using the simple WITH clause, you can specify a list of table variables (remember: everything is a table), which may even depend on each other.

That is easy to understand. This makes CTE (Common Table Expressions) already very useful, but what’s really really awesome is that they’re allowed to be recursive! Consider the following PostgreSQL example:

WITH RECURSIVE t(v) AS (
SELECT 1 -- Seed Row
UNION ALL
SELECT v + 1 -- Recursion
FROM t
)
SELECT v
FROM t
LIMIT 5

It yields

v
---
1
2
3
4
5

How does it work? It’s relatively easy, once you see through the many keywords. You define a common table expression that has exactly two UNION ALL subqueries.

The first UNION ALL subquery is what I usually call the “seed row”. It “seeds” (initialises) the recursion. It can produce one or several rows on which we will recurse afterwards. Remember: everything is a table, so our recursion will happen on a whole table, not on an individual row/value.

The second UNION ALL subquery is where the recursion happens. If you look closely, you will observe that it selects from t. I.e. the second subquery is allowed to select from the very CTE that we’re about to declare. Recursively. It thus has also access to the column v, which is being declared by the CTE that already uses it.

WITH RECURSIVE q(r, i, rx, ix, g) AS (
SELECT r::DOUBLE PRECISION * 0.02, i::DOUBLE PRECISION * 0.02,
.0::DOUBLE PRECISION , .0::DOUBLE PRECISION, 0
FROM generate_series(-60, 20) r, generate_series(-50, 50) i
UNION ALL
SELECT r, i, CASE WHEN abs(rx * rx + ix * ix) <= 2 THEN rx * rx - ix * ix END + r,
CASE WHEN abs(rx * rx + ix * ix) <= 2 THEN 2 * rx * ix END + i, g + 1
FROM q
WHERE rx IS NOT NULL AND g < 99
)
SELECT array_to_string(array_agg(s ORDER BY r), '')
FROM (
SELECT i, r, substring(' .:-=+*#%@', max(g) / 10 + 1, 1) s
FROM q
GROUP BY i, r
) q
GROUP BY i
ORDER BY i

Run the above on PostgreSQL, and you’ll get something like

This blog is full of running total examples. They’re some of the most educational examples to learn about advanced SQL, because there are at least a dozen of ways how to implement a running total.

A running total is easy to understand, conceptually.

In Microsoft Excel, you would simply calculate a sum (or difference) of two previous (or subsequent) values, and then use the useful crosshair cursor to pull that formula through your entire spreadsheet. You “run” that total through the spreadsheet. A “running total”.

Window functions are a powerful concept – not so easy to understand at first, but in fact, they’re really really easy:

Window functions are aggregations / rankings on a subset of rows relative to the current row being transformed by SELECT

That’s it. :)

What it essentially means is that a window function can perform calculations on rows that are “above” or “below” the current row. Unlike ordinary aggregations and GROUP BY, however, they don’t transform the rows, which makes them very useful.

The syntax can be summarised as follows, with individual parts being optional

function(...) OVER (
PARTITION BY ...
ORDER BY ...
ROWS BETWEEN ... AND ...
)

So, we have any sort of function (we’ll see examples for such functions later), followed by this OVER() clause, which specifies the window. I.e. this OVER() clause defines:

The PARTITION: Only rows that are in the same partition as the current row will be considered for the window

The ORDER: The window can be ordered independently of what we’re selecting

The ROWS (or RANGE) frame definition: The window can be restricted to a fixed amount of rows “ahead” and “behind”

That’s all there is to window functions.

Now how does that help us calculate a running total? Consider the following data:

Let’s assume that BALANCE is what we want to calculate from AMOUNT

Intuitively, we can immediately see that the following holds true:

So, in plain English, any balance can be expressed with the following pseudo SQL:

TOP_BALANCE - SUM(AMOUNT) OVER (
"all the rows on top of the current row"
)

In real SQL, that would then be written as follows:

SUM(t.amount) OVER (
PARTITION BY t.account_id
ORDER BY t.value_date DESC,
t.id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
)

Explanation:

The partition will calculate the sum for each bank account, not for the entire data set

The ordering will guarantee that transactions are ordered (within the partition) prior to summing

The rows clause will consider only preceding rows (within the partition, given the ordering) prior to summing

All of this will happen in-memory over the data set that has already been selected by you in your FROM .. WHERE etc. clauses, and is thus extremely fast.

Intermezzo

Before we move on to all the other awesome tricks, consider this: We’ve seen

(Recursive) Common Table Expressions (CTE)

Window functions

Both of these features are:

Awesome

Exremely powerful

Declarative

Part of the SQL standard

Available in most popular RDBMS

Very important building blocks

If anything can be concluded from this article, it is the fact that you should absolutely know these two building blocks of modern SQL. Why? Because:

4. Finding the Largest Series with no Gaps

Stack Overflow has this very nice feature to motivate people to stay on their website for as long as possible. Badges:

For scale, you can see how many badges I have. Tons.

How do you calculate these badges? Let’s have a look at the “Enthusiast” and the “Fanatic”. These badges are awarded to anyone who spends a given amount of consecutive days on their platform. Regardless of any wedding date or wife’s birthday, you HAVE TO LOG IN, or the counter starts from zero again.

Now as we’re doing declarative programming, we don’t care about maintaining any state and in-memory counters. We want to express this in the form of online analytic SQL. I.e. consider this data:

Wow. Interesting. So, 14 - 1 = 13, 15 - 2 = 13, 16 - 3 = 13, but 18 - 4 = 14. No one can say it better than Doge:

There’s a simple example for this behaviour:

ROW_NUMBER() never has gaps. That’s how it’s defined

Our data, however, does

So when we subtract a “gapless” series of consecutive integers from a “gapful” series of non-consecutive dates, we will get the same date for each “gapless” subseries of consecutive dates, and we’ll get a new date again where the date series had gaps.

Huh.

This means we can now simply GROUP BY this arbitrary date value:

SELECT
min(login_date), max(login_date),
max(login_date) -
min(login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC

And we’re done. The largest series of consecutive dates with no gaps has been found:

WITH
login_dates AS (
SELECT DISTINCT cast(login_time AS DATE) login_date
FROM logins WHERE user_id = :user_id
),
login_date_groups AS (
SELECT
login_date,
login_date - row_number() OVER (ORDER BY login_date) AS grp
FROM login_dates
)
SELECT
min(login_date), max(login_date),
max(login_date) - min(login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC

Not that hard in the end, right? Of course, having the idea makes all the difference, but the query itself is really very very simple and elegant. No way you could implement some imperative-style algorithm in a leaner way than this.

Whew.

5. Finding the Length of a Series

Previously, we had seen series of consecutive values. That’s easy to deal with as we can abuse of the consecutiveness of integers. What if the definition of a “series” is less intuitive, and in addition to that, several series contain the same values? Consider the following data, where LENGTH is the length of each series that we want to calculate:

Yes, you’ve guessed right. A “series” is defined by the fact that consecutive (ordered by ID) rows have the same SIGN(AMOUNT). Check again the data formatted as below:

In this table, we want to copy the row number value into “LO” at the “lower” end of a series, and into “HI” at the “upper” end of a series. For this we’ll be using the magical LEAD() and LAG(). LEAD() can access the n-th next row from the current row, whereas LAG() can access the n-th previous row from the current row. For example:

SELECT
lag(v) OVER (ORDER BY v),
v,
lead(v) OVER (ORDER BY v)
FROM (
VALUES (1), (2), (3), (4)
) t(v)

The above query produces:

That’s awesome! Remember, with window functions, you can perform rankings or aggregations on a subset of rows relative to the current row. In the case of LEAD() and LAG(), we simply access a single row relative to the current row, given its offset. This is useful in so many cases.

Continuing with our “LO” and “HI” example, we can simply write:

SELECT
trx.*,
CASE WHEN lag(sign)
OVER (ORDER BY id DESC) != sign
THEN rn END AS lo,
CASE WHEN lead(sign)
OVER (ORDER BY id DESC) != sign
THEN rn END AS hi,
FROM trx

… in which we compare the “previous” sign (lag(sign)) with the “current” sign (sign). If they’re different, we put the row number in “LO”, because that’s the lower bound of our series.

Then we compare the “next” sign (lead(sign)) with the “current” sign (sign). If they’re different, we put the row number in “HI”, because that’s the upper bound of our series.

Finally, a little boring NULL handling to get everything right, and we’re done:

SELECT -- With NULL handling...
trx.*,
CASE WHEN coalesce(lag(sign)
OVER (ORDER BY id DESC), 0) != sign
THEN rn END AS lo,
CASE WHEN coalesce(lead(sign)
OVER (ORDER BY id DESC), 0) != sign
THEN rn END AS hi,
FROM trx

Next step. We want “LO” and “HI” to appear in ALL rows, not just at the “lower” and “upper” bounds of a series. E.g. like this:

We’re using a feature that is available at least in Redshift, Sybase SQL Anywhere, DB2, Oracle. We’re using the “IGNORE NULLS” clause that can be passed to some window functions:

SELECT
trx.*,
last_value (lo) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS lo,
first_value(hi) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS hi
FROM trx

A lot of keywords! But the essence is always the same. From any given “current” row, we look at all the “previous values” (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), but ignoring all the nulls. From those previous values, we take the last value, and that’s our new “LO” value. In other words, we take the “closest preceding” “LO” value.

The same with “HI”. From any given “current” row, we look at all the “subsequent values” (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING), but ignoring all the nulls. From the subsequent values, we take the first value, and that’s our new “HI” value. In other words, we take the “closest following” “HI” value.

Explained in Powerpoint:

Getting it 100% correct, with a little boring NULL fiddling:

SELECT -- With NULL handling...
trx.*,
coalesce(last_value (lo) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW), rn) AS lo,
coalesce(first_value(hi) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING), rn) AS hi
FROM trx

Finally, we’re just doing a trivial last step, keeping in mind off-by-1 errors:

WITH
trx1(id, amount, sign, rn) AS (
SELECT id, amount, sign(amount), row_number() OVER (ORDER BY id DESC)
FROM trx
),
trx2(id, amount, sign, rn, lo, hi) AS (
SELECT trx1.*,
CASE WHEN coalesce(lag(sign) OVER (ORDER BY id DESC), 0) != sign
THEN rn END,
CASE WHEN coalesce(lead(sign) OVER (ORDER BY id DESC), 0) != sign
THEN rn END
FROM trx1
)
SELECT
trx2.*, 1
- last_value (lo) IGNORE NULLS OVER (ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
+ first_value(hi) IGNORE NULLS OVER (ORDER BY id DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM trx2

Huh. This SQL thing does start getting interesting!

Ready for more?

6. The subset sum problem with SQL

This is my favourite!

What is the subset sum problem? Find a fun explanation here: https://xkcd.com/287

How to do it with SQL? Easy. Just create a CTE that contains all the 2^{n}*possible* sums and then find the closest one for each TOTAL:

-- All the possible 2N sums
WITH sums(sum, max_id, calc) AS (...)
-- Find the best sum per “TOTAL”
SELECT
totals.total,
something_something(total - sum) AS best,
something_something(total - sum) AS calc
FROM draw_the_rest_of_the_*bleep*_owl

As you’re reading this, you might be like my friend here:

But don’t worry, the solution is – again – not all that hard (although it doesn’t perform due to the nature of the algorithm):

WITH sums(sum, id, calc) AS (
SELECT item, id, to_char(item) FROM items
UNION ALL
SELECT item + sum, items.id, calc || ' + ' || item
FROM sums JOIN items ON sums.id < items.id
)
SELECT
totals.id,
totals.total,
min (sum) KEEP (
DENSE_RANK FIRST ORDER BY abs(total - sum)
) AS best,
min (calc) KEEP (
DENSE_RANK FIRST ORDER BY abs(total - sum)
) AS calc
FROM totals
CROSS JOIN sums
GROUP BY totals.id, totals.total

In this article, I won’t explain the details of this solution, because the example has been taken from a previous article that you can find here:

Enjoy reading the details, but be sure to come back here for the remaining 4 tricks:

7. Capping a Running Total

So far, we’ve seen how to calculate an “ordinary” running total with SQL using window functions. That was easy. Now, how about if we cap the running total such that it never goes below zero? Essentially, we want to calculate this:

So, when that big negative AMOUNT-20000 was subtracted, instead of displaying the real TOTAL of -15550, we simply display 0. In other words (or data sets):

Exactly. With obscure, vendor-specific SQL. In this case, we’re using Oracle SQL

How does it work? Surprisingly easy!

Just add MODEL in your SELECT, and you’re opening up a can of awesome SQL worms!

SELECT ... FROM some_table
-- Put this after any table
MODEL ...

Once we put MODEL there, we can implement spreadsheet logic directly in our SQL statements, just as with Microsoft Excel.

The following three clauses are the most useful and widely used (i.e. 1-2 per year by anyone on this planet):

MODEL
-- The spreadsheet dimensions
DIMENSION BY ...
-- The spreadsheet cell type
MEASURES ...
-- The spreadsheet formulas
RULES ...

The meaning of each of these three additional clauses is best explained with slides again.

The DIMENSION BY clause specifies the dimensions of your spreadsheet. Unlike in MS Excel, you can have any number of dimensions in Oracle:

The MEASURES clause specifies the values that are available in each cell of your spreadsheet. Unlike in MS Excel, you can have a whole tuple in each cell in Oracle, not just a single value.

The RULES clause specifies the formulas that apply to each cell in your spreadsheet. Unlike in MS Excel, these rules / formulas are centralised at a single place, instead of being put inside of each cell:

This design makes MODEL a bit harder to use than MS Excel, but much more powerful, if you dare. The whole query will then be “trivially”:

SELECT *
FROM (
SELECT date, amount, 0 AS total
FROM amounts
)
MODEL
DIMENSION BY (row_number() OVER (ORDER BY date) AS rn)
MEASURES (date, amount, total)
RULES (
total[any] = greatest(0,
coalesce(total[cv(rn) - 1], 0) + amount[cv(rn)])
)

This whole thing is so powerful, it ships with its own whitepaper by Oracle, so rather than explaining things further here in this article, please do read the excellent whitepaper:

8. Time Series Pattern Recognition

If you’re into fraud detection or any other field that runs real time analytics on large data sets, time series pattern recognition is certainly not a new term to you.

If we review the “length of a series” data set, we might want to generate triggers on complex events over our time series as such:

Trigger on the 3^{rd} repetition of an event if the event occurs more than 3 times.

Similar to the previous MODEL clause, we can do this with an Oracle-specific clause that was added to Oracle 12c:

SELECT ... FROM some_table
-- Put this after any table to pattern-match
-- the table’s contents
MATCH_RECOGNIZE (...)

The simplest possible application of MATCH_RECOGNIZE includes the following subclauses:

SELECT *
FROM series
MATCH_RECOGNIZE (
-- Pattern matching is done in this order
ORDER BY ...
-- These are the columns produced by matches
MEASURES ...
-- A short specification of what rows are
-- returned from each match
ALL ROWS PER MATCH
-- «Regular expressions» of events to match
PATTERN (...)
-- The definitions of «what is an event»
DEFINE ...
)

That sounds crazy. Let’s look at some example clause implementations

SELECT *
FROM series
MATCH_RECOGNIZE (
ORDER BY id
MEASURES classifier() AS trg
ALL ROWS PER MATCH
PATTERN (S (R X R+)?)
DEFINE
R AS sign(R.amount) = prev(sign(R.amount)),
X AS sign(X.amount) = prev(sign(X.amount))
)

What do we do here?

We order the table by ID, which is the order in which we want to match events. Easy.

We then specify the values that we want as a result. We want the “MEASURE” trg, which is defined as the classifier, i.e. the literal that we’ll use in the PATTERN afterwards. Plus we want all the rows from a match.

We then specify a regular expression-like pattern. The pattern is an event “S” for Start, followed optionally by “R” for Repeat, “X” for our special event X, followed by one or more “R” for repeat again. If the whole pattern matches, we get SRXR or SRXRR or SRXRRR, i.e. X will be at the third position of a series of length >= 4

Finally, we define R and X as being the same thing: The event when SIGN(AMOUNT) of the current row is the same as SIGN(AMOUNT) of the previous row. We don’t have to define “S”. “S” is just any other row.

This query will magically produce the following output:

| ID | VALUE_DATE | AMOUNT | TRG |
|------|------------|---------|-----|
| 9997 | 2014-03-18 | + 99.17 | S |
| 9981 | 2014-03-16 | - 71.44 | R |
| 9979 | 2014-03-16 | - 94.60 | X |
| 9977 | 2014-03-16 | - 6.96 | R |
| 9971 | 2014-03-15 | - 65.95 | S |
| 9964 | 2014-03-15 | + 15.13 | S |
| 9962 | 2014-03-15 | + 17.47 | S |
| 9960 | 2014-03-15 | + 3.55 | S |
| 9959 | 2014-03-14 | - 32.00 | S |

We can see a single “X” in our event stream. Exactly where we had expected it. At the 3rd repetition of an event (same sign) in a series of length > 3.

Boom!

As we don’t really care about “S” and “R” events, let’s just remove them as such:

SELECT
id, value_date, amount,
CASE trg WHEN 'X' THEN 'X' END trg
FROM series
MATCH_RECOGNIZE (
ORDER BY id
MEASURES classifier() AS trg
ALL ROWS PER MATCH
PATTERN (S (R X R+)?)
DEFINE
R AS sign(R.amount) = prev(sign(R.amount)),
X AS sign(X.amount) = prev(sign(X.amount))
)

Again, don’t expect me to explain this any better than the excellent Oracle whitepaper already did, which I strongly recommend reading if you’re using Oracle 12c anyway:

9. Pivoting and Unpivoting

If you’ve read this far, the following will be almost too embarassingly simple:

This is our data, i.e. actors, film titles, and film ratings:

| NAME | TITLE | RATING |
|-----------|-----------------|--------|
| A. GRANT | ANNIE IDENTITY | G |
| A. GRANT | DISCIPLE MOTHER | PG |
| A. GRANT | GLORY TRACY | PG-13 |
| A. HUDSON | LEGEND JEDI | PG |
| A. CRONYN | IRON MOON | PG |
| A. CRONYN | LADY STAGE | PG |
| B. WALKEN | SIEGE MADRE | R |

Observe how we kinda grouped by the actors and then “pivoted” the number films per rating each actor played in. Instead of displaying this in a “relational” way, (i.e. each group is a row) we pivoted the whole thing to produce a column per group. We can do this, because we know all the possible groups in advance.

Unpivoting is the opposite, when from the above, we want to get back to the “row per group” representation:

| NAME | RATING | COUNT |
|-----------|--------|-------|
| A. GRANT | NC-17 | 3 |
| A. GRANT | PG | 6 |
| A. GRANT | G | 5 |
| A. GRANT | PG-13 | 3 |
| A. GRANT | R | 6 |
| A. HUDSON | NC-17 | 12 |
| A. HUDSON | PG | 4 |

It’s actually really easy. This is how we’d do it in PostgreSQL:

SELECT
first_name, last_name,
count(*) FILTER (WHERE rating = 'NC-17') AS "NC-17",
count(*) FILTER (WHERE rating = 'PG' ) AS "PG",
count(*) FILTER (WHERE rating = 'G' ) AS "G",
count(*) FILTER (WHERE rating = 'PG-13') AS "PG-13",
count(*) FILTER (WHERE rating = 'R' ) AS "R"
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id

We can append a simple FILTER clause to an aggregate function in order to count only some of the data.

In all other databases, we’d do it like this:

SELECT
first_name, last_name,
count(CASE rating WHEN 'NC-17' THEN 1 END) AS "NC-17",
count(CASE rating WHEN 'PG' THEN 1 END) AS "PG",
count(CASE rating WHEN 'G' THEN 1 END) AS "G",
count(CASE rating WHEN 'PG-13' THEN 1 END) AS "PG-13",
count(CASE rating WHEN 'R' THEN 1 END) AS "R"
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id

The nice thing here is that aggregate functions usually only consider non-NULL values, so if we make all the values NULL that are not interesting per aggregation, we’ll get the same result.

Now, if you’re using either SQL Server, or Oracle, you can use the built-in PIVOT or UNPIVOT clauses instead. Again, as with MATCH_RECOGNIZE, just append this new keyword after a table and get the same result:

-- PIVOTING
SELECT something, something
FROM some_table
PIVOT (
count(*) FOR rating IN (
'NC-17' AS "NC-17",
'PG' AS "PG",
'G' AS "G",
'PG-13' AS "PG-13",
'R' AS "R"
)
)
-- UNPIVOTING
SELECT something, something
FROM some_table
UNPIVOT (
count FOR rating IN (
"NC-17" AS 'NC-17',
"PG" AS 'PG',
"G" AS 'G',
"PG-13" AS 'PG-13',
"R" AS 'R'
)
)

Easy. Next.

10. Abusing XML and JSON

First off

JSON is just XML with less features and less syntax

Now, everyone knows that XML is awesome. The corollary is thus:

JSON is less awesome

Don’t use JSON.

Now that we’ve settled this, we can safely ignore the ongoing JSON-in-the-database-hype (which most of you will regret in five years anyway), and move on to the final example. How to do XML in the database.

This is what we want to do:

Given the original XML document, we want to parse that document, unnest the comma-separated list of films per actor, and produce a denormalised representation of actors/films in a single relation.

Ready. Set. Go. This is the idea. We have three CTE:

WITH RECURSIVE
x(v) AS (SELECT '...'::xml),
actors(
actor_id, first_name, last_name, films
) AS (...),
films(
actor_id, first_name, last_name,
film_id, film
) AS (...)
SELECT *
FROM films

In the first one, we simply parse the XML. Here with PostgreSQL:

WITH RECURSIVE
x(v) AS (SELECT '
<actors>
<actor>
<first-name>Bud</first-name>
<last-name>Spencer</last-name>
<films>God Forgives... I Don’t, Double Trouble, They Call Him Bulldozer</films>
</actor>
<actor>
<first-name>Terence</first-name>
<last-name>Hill</last-name>
<films>God Forgives... I Don’t, Double Trouble, Lucky Luke</films>
</actor>
</actors>'::xml),
actors(actor_id, first_name, last_name, films) AS (...),
films(actor_id, first_name, last_name, film_id, film) AS (...)
SELECT *
FROM films

Easy.

Then, we do some XPath magic to extract the individual values from the XML structure and put those into columns:

WITH RECURSIVE
x(v) AS (SELECT '...'::xml),
actors(actor_id, first_name, last_name, films) AS (
SELECT
row_number() OVER (),
(xpath('//first-name/text()', t.v))[1]::TEXT,
(xpath('//last-name/text()' , t.v))[1]::TEXT,
(xpath('//films/text()' , t.v))[1]::TEXT
FROM unnest(xpath('//actor', (SELECT v FROM x))) t(v)
),
films(actor_id, first_name, last_name, film_id, film) AS (...)
SELECT *
FROM films

Still easy.

Finally, just a bit of recursive regular expression pattern matching magic, and we’re done!

WITH RECURSIVE
x(v) AS (SELECT '...'::xml),
actors(actor_id, first_name, last_name, films) AS (...),
films(actor_id, first_name, last_name, film_id, film) AS (
SELECT actor_id, first_name, last_name, 1,
regexp_replace(films, ',.+', '')
FROM actors
UNION ALL
SELECT actor_id, a.first_name, a.last_name, f.film_id + 1,
regexp_replace(a.films, '.*' || f.film || ', ?(.*?)(,.+)?', '\1')
FROM films AS f
JOIN actors AS a USING (actor_id)
WHERE a.films NOT LIKE '%' || f.film
)
SELECT *
FROM films

Let’s conclude:

Conclusion

All of what this article has shown was declarative. And relatively easy. Of course, for the fun effect that I’m trying to achieve in this talk, some exaggerated SQL was taken and I expressly called everything “easy”. It’s not at all easy, you have to practice SQL. Like many other languages, but a bit harder because:

The syntax is a bit awkward from time to time

Declarative thinking is not easy. At least, it’s very different

But once you get a hang of it, declarative programming with SQL is totally worth it as you can express complex relationships between your data in very very little code by just describing the result you want to get from the database.

Isn’t that awesome?

And if that was a bit over the top, do note that I’m happy to visit your JUG / conference to give this talk (just contact us), or if you want to get really down into the details of these things, we also offer this talk as a public or in-house workshop. Do get in touch! We’re looking forward.

Here’s the question by John that was looking for a Java solution:

With an array of N elements which are initialized to 0. we are given a sequence of M operations of the sort (p; q; r). The operation (p; q; r) signifies that the integer r should be added to all array elements A[p];A[p + 1]; : : : ;A[q]. You are to output the maximum element in the array that would result from performing all M operations. There is a naive solution that simply performs all operations and then returns the maximum value, that takes O(MN) time. We are looking for a more efficient algorithm.

Interesting. Indeed, a naive solution would just perform all the operations as requested. Another naive but less naive solution would transform the operations into signals of the form (x; y) for all (p; r) and for all (q + 1; -r). In other words, we could implement the solution I had presented trivially as such:

// This is just a utility class to model the ops
class Operation {
final int p;
final int q;
final int r;
Operation(int p, int q, int r) {
this.p = p;
this.q = q;
this.r = r;
}
}
// These are some example ops
Operation[] ops = {
new Operation(4, 12, 2),
new Operation(2, 8, 3),
new Operation(6, 7, 1),
new Operation(3, 7, 2)
};
// Here, we're calculating the min and max
// values for the combined values of p and q
IntSummaryStatistics stats = Stream
.of(ops)
.flatMapToInt(op -> IntStream.of(op.p, op.q))
.summaryStatistics();
// Create an array for all the required elements using
// the min value as "offset"
int[] array = new int[stats.getMax() - stats.getMin()];
// Put +r and -r "signals" into the array for each op
for (Operation op : ops) {
int lo = op.p - stats.getMin();
int hi = op.q + 1 - stats.getMin();
if (lo >= 0)
array[lo] = array[lo] + op.r;
if (hi < array.length)
array[hi] = array[hi] - op.r;
}
// Now, calculate the prefix sum sequentially in a
// trivial loop
int maxIndex = Integer.MIN_VALUE;
int maxR = Integer.MIN_VALUE;
int r = 0;
for (int i = 0; i < array.length; i++) {
r = r + array[i];
System.out.println((i + stats.getMin()) + ":" + r);
if (r > maxR) {
maxIndex = i + stats.getMin();
maxR = r;
}
}
System.out.println("---");
System.out.println(maxIndex + ":" + maxR);

The above program would print out:

2:3
3:5
4:7
5:7
6:8
7:8
8:5
9:2
10:2
11:2
---
6:8

So, the maximum value is generated at position 6, and the value is 8.

Faster calculation in Java 8

This can be calculated faster using Java 8’s new Arrays.parallelPrefix() operation. Instead of the loop in the end, just write:

In SQL, the naive sequential and linear complexity solution can easily be re-implemented, and I’m showing a solution for PostgreSQL.

How can we do it? We’re using a couple of features here. First off, we’re using common table expressions (also known as the WITH clause). We’re using these to declare table variables. The first variable is the op table, which contains our operation instructions, like in Java:

WITH
op (p, q, r) AS (
VALUES
(4, 12, 2),
(2, 8, 3),
(6, 7, 1),
(3, 7, 2)
),
...

This is trivial. We’re essentially just generating a couple of example values.

The second table variable is the signal table, where we use the previously described optimisation of putting a +r signal at all p positions, and a -r signal at all q + 1 positions:

WITH
...,
signal(x, r) AS (
SELECT p, r
FROM op
UNION ALL
SELECT q + 1, -r
FROM op
)
...

SELECT x, SUM(r) OVER (ORDER BY x)
FROM signal
ORDER BY x

x r
------
2 3
3 5
4 7
6 8
8 5
8 5
9 2
13 0

Now just find the max value for r, and we’re all set. We’ll take the shortcut by using ORDER BY and LIMIT:

SELECT x, SUM(r) OVER (ORDER BY x) AS s
FROM signal
ORDER BY s DESC
LIMIT 1

And we’re back with:

x r
------
6 8

Perfect! Here’s the full query:

WITH
op (p, q, r) AS (
VALUES
(4, 12, 2),
(2, 8, 3),
(6, 7, 1),
(3, 7, 2)
),
signal(x, r) AS (
SELECT p, r
FROM op
UNION ALL
SELECT q + 1, -r
FROM op
)
SELECT x, SUM(r) OVER (ORDER BY x) AS s
FROM signal
ORDER BY s DESC
LIMIT 1

Can you beat the conciseness of this SQL solution? I bet you can’t. Challengers shall write alternatives in the comment section.

I have a text files that have a lot of string lines in there. If I want to find lines before and after a matching in grep, I will do like this:

grep -A 10 -B 10 "ABC" myfile.txt

How can I implement the equivalent in java 8 using streams?

So the question is:

How can I implement the equivalent in Java 8 using streams?

Well, the unix shell and its various “pipable” commands are about the only thing that are even more awesome (and mysterious) than window functions. Being able to grep for a certain string in a file, and then display a “window” of a couple of lines is quite useful.

But unlike SQL, composing functions in Java (or other general purpose languages) is a bit simpler as there is less syntax clutter involved. We can easily do aggregations over a window frame to collect a generic amount of lines “lagging” and “leading” a match. Consider the following alternative:

You heard right. Up until now, the awesome window functions were a feature uniquely reserved to SQL. Even sophisticated functional programming languages still seem to lack this beautiful functionality (correct me if I’m wrong, Haskell folks).

We’ve written tons of blog posts about window functions, evangelising them to our audience, in articles like:

With SQL, this is a piece of cake. Observe the usage of SUM(t.amount) OVER(...):

SELECT
t.*,
t.current_balance - NVL(
SUM(t.amount) OVER (
PARTITION BY t.account_id
ORDER BY t.value_date DESC,
t.id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
),
0) AS balance
FROM v_transactions t
WHERE t.account_id = 1
ORDER BY t.value_date DESC,
t.id DESC

Despite the sometimes a bit scary syntax, window functions are really very easy to understand. Windows are “views” of the data produced in your FROM / WHERE / GROUP BY / HAVING clauses. They allow you to access all the other rows relative to the current row, while you calculate something in your SELECT clause (or rarely, in your ORDER BY clause). What the above statement really does is this:

I.e. for any given balance, subtract from the current balance the SUM() “OVER()” the window of all the rows that are in the same partition as the current row (same bank account), and that are strictly “above” the current row.

Or, in detail:

PARTITION BY specifies “OVER()” which rows the window spans

ORDER BY specifies how the window is ordered

ROWS specifies what ordered row indexes should be considered

Can we do this with Java collections?

Yes we can! If you’re using jOOλ: A completely free Open Source, Apache 2.0 licensed library that we designed because we thought that the JDK 8 Stream and Collector APIs just don’t do it.

When Java 8 was designed, a lot of focus went into supporting parallel streams. That’s nice but certainly not the only useful area where functional programming can be applied. We’ve created jOOλ to fill this gap – without implementing an all new, alternative collections API, such as vavr or functional java have.

jOOλ already provides:

Tuple types

More useful stuff for ordered, sequential-only streams

With the recently released jOOλ 0.9.9, we’ve added two main new features:

Tons of new Collectors

Window functions

The many missing collectors in the JDK

The JDK ships with a couple of collectors, but they do seem awkward and verbose, and no one really appreciates writing collectors like the ones exposed in this Stack Overflow question (and many others).

But the use case exposed in the linked question is a very valid one. You want to aggregate several things from a list of person:

public class Person {
private String firstName;
private String lastName;
private int age;
private double height;
private double weight;
// getters / setters

This is a ridiculous problem for anyone used to writing SQL:

SELECT count(*), max(age), min(height), avg(weight)
FROM person

Done. How hard can it be in Java? It turns out that a lot of glue code needs to be written with vanilla JDK 8 API. Consider the sophisticated answers given

Note that this isn’t running a query against a SQL database (that’s what jOOQ is for). We’re running this “query” against an in-memory Java collection.

OK ok, that’s already awesome. Now what about window functions?

Right, the title of this article didn’t promise trivial aggregation stuff. It promised the awesome window functions.

Yet, window functions are nothing else than aggregations (or rankings) on a subset of your data stream. Instead of aggregating all of the stream (or table) into a single record, you want to maintain the original records, and provide the aggregation on each individual record directly.

SELECT
v,
ROW_NUMBER() OVER(w),
RANK() OVER(w),
DENSE_RANK() OVER(w)
FROM (
VALUES('a'),('a'),('a'),('b'),
('c'),('c'),('d'),('e')
) t(v)
WINDOW w AS (ORDER BY v);

It yields:

| V | ROW_NUMBER | RANK | DENSE_RANK |
|---|------------|------|------------|
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |

Again, do note that we’re not running any queries against a database. Everything is done in memory.

Notice two things:

jOOλ’s window functions return 0-based ranks, as is expected for Java APIs, as opposed to SQL, which is all 1-based.

In Java, it is not possible to construct ad-hoc records with named columns. That’s unfortunate, and I do hope a future Java will provide support for such language features.

Let’s review what happens exactly in the code:

System.out.println(
// This is just enumerating our values
Seq.of("a", "a", "a", "b", "c", "c", "d", "e")
// Here, we specify a single window to be
// ordered by the value T in the stream, in
// natural order
.window(naturalOrder())
// The above window clause produces a Window<T>
// object (the w here), which exposes...
.map(w -> tuple(
// ... the current value itself, of type String...
w.value(),
// ... or various rankings or aggregations on
// the above window.
w.rowNumber(),
w.rank(),
w.denseRank()
))
// Just some nice formatting to produce the table
.format()
);

+----+----+----+---------+---------+----------+
| v0 | v1 | v2 | v3 | v4 | v5 |
+----+----+----+---------+---------+----------+
| a | 1 | a | a | {empty} | a |
| a | 2 | a | a | a | aa |
| a | 3 | a | b | a | aaa |
| b | 4 | a | c | a | aaab |
| c | 5 | a | c | b | aaabc |
| c | 6 | a | d | c | aaabcc |
| d | 7 | b | e | c | aaabccd |
| e | 8 | b | {empty} | d | aaabccde |
+----+----+----+---------+---------+----------+

Your analytics heart should be jumping, now.

Wait a second. Can we do frames, too, as in SQL? Yes, we can. Just as in SQL, when we omit the frame clause on a window definition (but we do specify an ORDER BY clause), then the following is applied by default:

RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW

We’ve done this in the previous examples. It can be seen in column v5, where we aggregate the string from the very first value up until the current value. So, let’s specify the frame then:

+----+----+----+---------+---------+-----+
| v0 | v1 | v2 | v3 | v4 | v5 |
+----+----+----+---------+---------+-----+
| a | 2 | a | a | {empty} | aa |
| a | 3 | a | a | a | aaa |
| a | 3 | a | b | a | aab |
| b | 3 | b | c | a | abc |
| c | 3 | c | c | b | bcc |
| c | 3 | c | d | c | ccd |
| d | 3 | d | e | c | cde |
| e | 2 | d | {empty} | d | de |
+----+----+----+---------+---------+-----+

As expected, lead() and lag() are unaffected, as opposed to count(), median(), and toString()

Awesome! Now, let’s review the running total.

Often, you don’t calculate window functions on the scalar value of the stream itself, as that value is usually not a scalar value but a tuple (or a POJO in Java-speak). Instead, you extract values from the tuple (or POJO) and perform the aggregation on that. So, again, when calculating the BALANCE, we need to extract the AMOUNT first.

The comparator now takes two comparisons into account. Unforunately JEP-101 wasn’t entirely implemented, which is why we need to help the compiler with type inference here.

The Window.value() is now a tuple, not a single value. So we need to extract the interesting column from it, the AMOUNT (via t -> t.v3). On the other hand, we can simply concat() that additional value to the tuple

But that’s already it. Apart from the verbosity of the comparator (which we’ll certainly address in a future jOOλ version), writing a window function is a piece of cake.

What else can we do?

This article is not a complete description of all we can do with the new API. We’ll soon write a follow-up blog post with additional examples. For instance:

The partition by clause wasn’t described, but is available too

You can specify many more windows than the single window exposed here, each with individual PARTITION BY, ORDER BY and frame specifications

Also, the current implementation is rather canonical, i.e. it doesn’t (yet) cache aggregations:

For unordered / unframed windows (same value for all the partition)

Strictly ascendingly framed windows (aggregation can be based on previous value, for associative collectors like SUM(), or toString())

That’s it from our part. Download jOOλ, play around with it and enjoy the fact that the most awesome SQL feature is now available for all of you Java 8 developers! https://github.com/jOOQ/jOOL

Col1 Col2 Col3 Col4
----------------------
A 0 1 5
B 0 4 0
C 2 0 0
D 0 0 0
E 3 5 0
F 0 3 0
G 0 3 1
H 0 1 5
I 3 5 0

The above data set contains a couple of interesting data points that are non-zero, and some gaps modelled by the value zero. In other examples, we could replace zero by NULL, but it would still be the same problem. The desired result is the following:

Col1 Col2 Col3 Col4
----------------------
A 0 1 5
B 0 45
C 24 5
D 2 4 5
E 3 5 5
F 3 3 5
G 3 3 1
H 3 1 5
I 3 5 5

Note that all the generated values are highlighted in red, and they correspond to the most recent blue value.

How to do it with SQL? We’ll be looking at two solutions:

A solution using window functions

This is the solution you should be looking for, and there are two answers in the linked Stack Overflow question that both make use of window functions:

Both solutions are roughly equivalent. Here’s how they work (using Oracle syntax):

WITH t(col1, col2, col3, col4) AS (
SELECT 'A', 0, 1, 5 FROM DUAL UNION ALL
SELECT 'B', 0, 4, 0 FROM DUAL UNION ALL
SELECT 'C', 2, 0, 0 FROM DUAL UNION ALL
SELECT 'D', 0, 0, 0 FROM DUAL UNION ALL
SELECT 'E', 3, 5, 0 FROM DUAL UNION ALL
SELECT 'F', 0, 3, 0 FROM DUAL UNION ALL
SELECT 'G', 0, 3, 1 FROM DUAL UNION ALL
SELECT 'H', 0, 1, 5 FROM DUAL UNION ALL
SELECT 'I', 3, 5, 0 FROM DUAL
)
SELECT
col1,
nvl(last_value(nullif(col2, 0))
IGNORE NULLS OVER (ORDER BY col1), 0) col2,
nvl(last_value(nullif(col3, 0))
IGNORE NULLS OVER (ORDER BY col1), 0) col3,
nvl(last_value(nullif(col4, 0))
IGNORE NULLS OVER (ORDER BY col1), 0) col4
FROM t

Now, let’s decompose these window functions:

NULLIF(colx, 0)

This is just an easy way of producing NULL values whenever we have what is an accepted “empty” value in our data set. So, instead of zeros, we just get NULL. Applying this function to our data, we’re getting:

Col1 Col2 Col3 Col4
----------------------
A NULL 1 5
B NULL 4 NULL
C 2 NULL NULL
D NULL NULL NULL
E 3 5 NULL
F NULL 3 NULL
G NULL 3 1
H NULL 1 5
I 3 5 NULL

We’re doing this because now, we can make use of the useful IGNORE NULLS clause that is available to some ranking functions, specifically LAST_VALUE(), or LAG(). We can now write:

last_value(...) IGNORE NULLS OVER (ORDER BY col1)

Where we take the last non-NULL value that precedes the current row when ordering rows by col1:

If the current row contains a non-NULL value, we’re taking that value.

If the current row contains a NULL value, we’re going “up” until we reach a non-NULL value

If we’re going “up” and we haven’t reached any non-NULL value, well, we get NULL

This leads to the following result:

Col1 Col2 Col3 Col4
----------------------
A NULL 1 5
B NULL 4 5
C 2 4 5
D 2 4 5
E 3 5 5
F 3 3 5
G 3 3 1
H 3 1 5
I 3 5 5

Note that with most window functions, once you specify an ORDER BY clause, then the following frame clause is taken as a default:

last_value(...) IGNORE NULLS OVER (
ORDER BY col1
ROWS BETWEEN UNBOUNDED PRECEEDING AND CURRENT ROW
)

That’s a lot of keywords, but their meaning is not really that obscure once you get a hang of window functions. We suggest reading the following blog posts to learn more about them:

Finally, because we don’t want those NULL values to remain in our results, we simply remove them using NVL() (or COALESCE() in other databases):

nvl(last_value(...) IGNORE NULLS OVER (...), 0)

Easy, isn’t it? Note, that in this particular case, LAG() and LAST_VALUE() will have the same effect.

A solution using the MODEL clause

Whenever you have a problem in (Oracle) SQL, that starts getting hard to solve with window functions, the Oracle MODEL clause might offer an “easy” solution to it. I’m using quotes on “easy”, because the syntax is a bit hard to remember, but the essence of it is really not that hard.

The MODEL clause is nothing else than an Oracle-specific dialect for implementing spreadsheet-like logic in the database. I highly recommend reading the relevant Whitepaper by Oracle, which explains the functionality very well:

Here’s how you could tackle the problem with MODEL (and bear with me):

WITH t(col1, col2, col3, col4) AS (
SELECT 'A', 0, 1, 5 FROM DUAL UNION ALL
SELECT 'B', 0, 4, 0 FROM DUAL UNION ALL
SELECT 'C', 2, 0, 0 FROM DUAL UNION ALL
SELECT 'D', 0, 0, 0 FROM DUAL UNION ALL
SELECT 'E', 3, 5, 0 FROM DUAL UNION ALL
SELECT 'F', 0, 3, 0 FROM DUAL UNION ALL
SELECT 'G', 0, 3, 1 FROM DUAL UNION ALL
SELECT 'H', 0, 1, 5 FROM DUAL UNION ALL
SELECT 'I', 3, 5, 0 FROM DUAL
)
SELECT * FROM t
MODEL
DIMENSION BY (row_number() OVER (ORDER BY col1) rn)
MEASURES (col1, col2, col3, col4)
RULES (
col2[any] = DECODE(col2[cv(rn)], 0, NVL(col2[cv(rn) - 1], 0), col2[cv(rn)]),
col3[any] = DECODE(col3[cv(rn)], 0, NVL(col3[cv(rn) - 1], 0), col3[cv(rn)]),
col4[any] = DECODE(col4[cv(rn)], 0, NVL(col4[cv(rn) - 1], 0), col4[cv(rn)])
)

There are three clauses that are of interest here:

The DIMENSION BY clause

Like in a Microsoft Excel spreadsheet, the DIMENSION corresponds to the consecutive, distinct index of each spreadsheet cell, by which we want to access the cell. In Excel, there are always two dimensions (one written with letters A..Z, AA..ZZ, …) and the other one written with numbers (1..infinity).

Using MODEL, you can specify as many dimensions as you want. In our example, we’ll only use one, the row number of each row, ordered by col1 (another use case for a window function).

The MEASURES clause

The MEASURES clause specifies the individual cell values for each “cell”. In Microsoft Excel, a cell can have only one value. In Oracle’s MODEL clause, we can operate on many values at once, within a “cell”.

In this case, we’ll just make all the columns our cells.

The RULES clause

This is the really interesting part in the MODEL clause. Here, we specify by what rules we want to calculate the values of each individual cell. The syntax is simple:

RULES (
<rule 1>,
<rule 2>,
...,
<rule N>
)

Each individual rule can implement an assignment of the form:

RULES (
cell[dimension(s)] = rule
)

In our case, we’ll repeat the same rule for cells col2, col3, and col4, and for any value of the dimension rn (for row number). So, the left-hand side of the assignment is

DECODE is a simple and useful Oracle function that takes a first argument, compares it with argument 2, and if they’re the same, returns argument 3, otherwise argument 4. It works like a CASE, which is a bit more verbose:

DECODE(A, B, C, D)
-- The same as:
CASE A WHEN B THEN C ELSE D END

cv(rn)

cv() is a MODEL specific “function” that means “current value”. On the left-hand side of the assignment, we used "any" as the dimension specifier, so we’re applying this rule for “any” value of rn. In order to access a specific rn value, we’ll simply write cv(rn), or the “current value of rn”.

recursiveness

The RULES of the MODEL clause are allowed to span a recursive tree (although not a graph, so no cycles are allowed), where each cell can be defined based on a previous cell, which is again defined based on its predecessor. We’re doing this via col2[cv(rn) - 1], where cv(rn) - 1 means the “current row number minus one”.

Easy, right? Granted. The syntax isn’t straight-forward and we’re only scratching the surface of what’s possible with MODEL.

Conclusion

SQL provides cool ways to implementing data-driven, declarative specifications of what your data should be like. The MODEL clause is a bit eerie, but at the same time extremely powerful. Much easier and also a bit faster are window functions, a tool that should be in the tool chain of every developer working with SQL.

In this article, we’ve shown how to fill gaps in sparse data using window functions or MODEL. A similar use-case are running totals. If this article has triggered your interest, I suggest reading about different approaches of calculating a running total in SQL.

A very interesting problem that can be solved very easily with SQL is to find consecutive series of events in a time series. But what is a consecutive series of events in a time series?

Take Stack Overflow, for example. Stack Overflow has a cool reputation system that uses badges to reward certain behaviour. As a social website, they encourage users to visit the platform every day. As such, two distinct badges are awarded:

Informally, it is obvious what this means. You’ll have to log in on day 1. Then again on day 2. Then again (perhaps several times, it doesn’t matter) on day 3. Forgot to log in on day 4? Ooops. We’ll start counting again.

Note that we won’t query the consecutive days of visits, as this information is not made available publicly. Instead, let’s query the consecutive days of posts a user has made.

The backing database is SQL Server, so we can run the following statement:

SELECT DISTINCT CAST(CreationDate AS DATE) AS date
FROM Posts
WHERE OwnerUserId = ##UserId##
ORDER BY 1

… which, for my own UserId generates something like:

As we can see in the data, there have been gaps in the very early days:

date
--------------------------------------
2010-11-26
2010-11-27 <---- Gap here after 2 days
2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03 <---- Gap here after 5 days
2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09 <---- Gap here after 5 days
2010-12-13
2010-12-14
...

Visually, it is very easy to see how many days in a row there were posts without any gaps. But how to do it with SQL?

To simplify the problem, let’s “store” individual queries in common table expressions. The above query, we’ll call dates:

WITH
-- This table contains all the distinct date
-- instances in the data set
dates(date) AS (
SELECT DISTINCT CAST(CreationDate AS DATE)
FROM Posts
WHERE OwnerUserId = ##UserId##
)
...

Now, the goal of the resulting query is to put all consecutive dates in the same group, such that we can aggregate over this group. The following query is what we want to write:

SELECT
COUNT(*) AS consecutiveDates,
MIN(date) AS minDate,
MAX(date) AS maxDate
FROM groups
GROUP BY grp -- This "grp" value will be explained later
ORDER BY 1 DESC, 2 DESC

We’d like to aggregate each group “grp” and count the number of dates in the group, as well as find the lowest and the highest date within each group.

Generating groups for consecutive dates

Let’s look at the data again, and to illustrate the idea, we’ll add consecutive row numbers, regardless of the gaps in dates:

row number date
--------------------------------
1 2010-11-26
2 2010-11-27
3 2010-11-29 <-- gap before this row
4 2010-11-30
5 2010-12-01
6 2010-12-02
7 2010-12-03
8 2010-12-05 <-- gap before this row

As you can see, regardless whether there is a gap between dates (two dates are not consecutive), their row numbers will still be consecutive. We can do this with the ROW_NUMBER() window function, very easily:

SELECT
ROW_NUMBER() OVER (ORDER BY date) AS [row number],
date
FROM dates

Now, let’s check out the following, interesting query:

WITH
-- This table contains all the distinct date
-- instances in the data set
dates(date) AS (
SELECT DISTINCT CAST(CreationDate AS DATE)
FROM Posts
WHERE OwnerUserId = ##UserId##
),
-- Generate "groups" of dates by subtracting the
-- date's row number (no gaps) from the date itself
-- (with potential gaps). Whenever there is a gap,
-- there will be a new group
groups AS (
SELECT
ROW_NUMBER() OVER (ORDER BY date) AS rn,
dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
date
FROM dates
)
SELECT *
FROM groups
ORDER BY rn

All we did is subtract the row number from the date to get a new date “grp“. The actual date obtained this way is irrelevant. It’s just an auxiliary value.

What we can guarantee, though, is that for consecutive dates, the value of grp will be the same because for all consecutive dates, the following two equations yield true:

date2 - date1 = 1 // difference in days between dates
rn2 - rn1 = 1 // difference in row numbers

Yet, for non-consecutive dates, while the difference in row numbers is still 1, the difference in days is no longer 1. The groups can now be seen easily:

WITH
-- This table contains all the distinct date
-- instances in the data set
dates(date) AS (
SELECT DISTINCT CAST(CreationDate AS DATE)
FROM Posts
WHERE OwnerUserId = ##UserId##
),
-- Generate "groups" of dates by subtracting the
-- date's row number (no gaps) from the date itself
-- (with potential gaps). Whenever there is a gap,
-- there will be a new group
groups AS (
SELECT
ROW_NUMBER() OVER (ORDER BY date) AS rn,
dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
date
FROM dates
)
SELECT
COUNT(*) AS consecutiveDates,
MIN(date) AS minDate,
MAX(date) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

The fact that we chose the granularity of days in the above query is a random choice. We simply took the timestamp from our time series and “collapsed” it to the desired granularity using a CAST function:

SELECT DISTINCT CAST(CreationDate AS DATE)

If we want to know the consecutive weeks, we’ll simply change that function to a different expression, e.g.

This new expression takes the year and the week and generates values like 201503 for week 03 in the year 2015. The rest of the statement remains exactly the same:

WITH
weeks(week) AS (
SELECT DISTINCT datepart(year, CreationDate) * 100
+ datepart(week, CreationDate)
FROM Posts
WHERE OwnerUserId = ##UserId##
),
groups AS (
SELECT
ROW_NUMBER() OVER (ORDER BY week) AS rn,
dateadd(day, -ROW_NUMBER() OVER (ORDER BY week), week) AS grp,
week
FROM weeks
)
SELECT
COUNT(*) AS consecutiveWeeks,
MIN(week) AS minWeek,
MAX(week) AS maxWeek
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

If we go back to our consecutive days example, we can rewrite the query to find the distinct dates AND the groups in one go, using DENSE_RANK():

WITH
groups(date, grp) AS (
SELECT DISTINCT
CAST(CreationDate AS DATE),
dateadd(day,
-DENSE_RANK() OVER (ORDER BY CAST(CreationDate AS DATE)),
CAST(CreationDate AS DATE)) AS grp
FROM Posts
WHERE OwnerUserId = ##UserId##
)
SELECT
COUNT(*) AS consecutiveDates,
MIN(date) AS minDate,
MAX(date) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

The above has been one very useful example of using window functions (ROW_NUMBER()) in SQL. Learn more about window functions in any of the following articles:

In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:

The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:

ID ASSIGN_AMT
--------------
1 25150
2 19800
3 27511

The ASSIGN_AMT value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of WORK_AMT values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:

The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by ID. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold

The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.

The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique!

But let’s look at the simpler understanding first:

1. Calculating a sum of an ordered subset of values

By strictly ordered we mean (in the sense of the question) that we want to order all WORK_AMT values, e.g. by ID, and allow only for sums that can appear in this particular order. I.e.

WITH
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
)
SELECT
ID,
WORK_AMT,
SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM
FROM
VALS
ORDER BY
ID

The above window function calculates the sum of all WORK_AMT values that are in the “window” of values where the ID is smaller or equal to the current ID.

Finding the “closest” of these sums with quantified comparison predicates

Now, the task at hand is to find for each value ASSIGN_AMT in 25150, 19800, and 27511 the closest value of SUBSET_SUM. In a way, what we are trying to do is we want to minimise the expression ABS(SUBSET_SUM - ASSIGN_AMT).

However, the MIN() aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of SUBSET_SUM that produces this difference in the first place.

One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:

-- The common table expressions are the same as
-- in the previous examples
WITH
ASSIGN(ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
),
SUMS (ID, WORK_AMT, SUBSET_SUM) AS (
SELECT
VALS.*,
SUM (WORK_AMT) OVER (ORDER BY ID)
FROM
VALS
)
-- This is now the interesting part, where we
-- calculate the closest sum
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM)
FROM
SUMS
)

The quantified comparison predicate reads intuitively. We’re looking for that specific SUBSET_SUM whose difference to the “expected” ASSIGN_AMT is smaller or equal to all the other possible differences.

You may have noticed that the solution is not entirely correct in the event where WORK_AMT is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the JOIN. This can be achieved when replacing:

One solution is to remove those duplicates using DISTINCT (which is an anti-pattern. See #6 in this list). A better solution is to make the JOIN predicate unambiguous by comparing ID values as well, i.e.:

SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
(ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM),
ID
FROM
SUMS
)

The above unfortunately doesn’t work in Oracle (yet), which will report an error:

ORA-01796: this operator cannot be used with lists

Finding the “closest” of these sums with Oracle’s FIRST function

Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:

SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT

Essentially, we’re grouping all values from the SUMS table for each ASSIGN_AMT. For each of those groups, we’ll look for the "FIRST" row that appears when ordering rows within the group by our criteria ABS(ASSIGN_AMT - SUBSET_SUM). We "KEEP" only those rows in the group, and retain from those rows the minimum SUBSET_SUM.

SELECT DISTINCT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
FIRST_VALUE (SUBSET_SUM) OVER (
PARTITION BY ASSIGN.ID
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS

Unfortunately, these solutions all produce duplicates, which we have to remove either via GROUP BY (KEEP solution), or via DISTINCT (FIRST_VALUE solution).

Finding the “closest” of these sums with LATERAL JOIN

A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new FETCH FIRST clause along with CROSS JOIN LATERAL (or CROSS APPLY, which is the same)

SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN LATERAL (
SELECT
SUBSET_SUM AS CLOSEST_SUM
FROM
SUMS
ORDER BY
ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM)
FETCH FIRST 1 ROW ONLY
) SUMS

What does this mean? We’re essentially joining for each value in ASSIGN only the FIRST 1 ROW in SUMS, ordered by the usual criteria. We need LATERAL (or APPLY), because this allows us to reference columns from the left side of the JOIN expression also in the right side, which wouldn’t be possible otherwise.

The same functionality is supported in SQL Server (only using CROSS APPLY), or in PostgreSQL (only using CROSS JOIN LATERAL).

Lateral can be very useful whenever the right hand side of a JOIN depends on the left hand side. Unlike ordinary joins, this means that the JOIN order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with ORDER BY and FETCH FIRST), or when joining unnested table-valued functions, which we’ll cover in a future blog post.

2. Calculating a sum of any subset of values

So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of WORK_AMT values.

We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:

In recursive SQL, we need a UNION ALL query in a common table expression (WITH clause in Oracle, or WITH RECURSIVE clause in PostgreSQL). The first subquery of UNION ALL generates the “seed row(s)” of the recursion, and the second subqeury of UNION ALL generates the recursion based on a SELECT that selects from the table being declared, recursively.

So, a naive solution to this subset sum problem can be seen here:

-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
),
-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID) AS (
SELECT
WORK_AMT,
ID
FROM
WORK
UNION ALL
SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)
-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT

The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in WORK:

SELECT
WORK_AMT,
ID
FROM
WORK

In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (SUMS) with all the remaining values (WORK), i.e. all the values that have a higher ID:

SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID

In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative.

The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the SUMS table.

After a still acceptable 0.112s for 10 different WORK_AMT values, the database calculated:

But beware, as soon as you start adding values to the VALS table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightly bigger WORK table already requires 16.3 seconds to yield a result:

WORK(ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
UNION ALL SELECT 11, 1234 FROM DUAL
UNION ALL SELECT 12, 61 FROM DUAL
UNION ALL SELECT 13, 616 FROM DUAL
UNION ALL SELECT 14, 2456 FROM DUAL
UNION ALL SELECT 15, 5161 FROM DUAL
UNION ALL SELECT 16, 414 FROM DUAL
UNION ALL SELECT 17, 516 FROM DUAL
UNION ALL SELECT 18, 617 FROM DUAL
UNION ALL SELECT 19, 146 FROM DUAL
),

Want proof about the actual sum? That’s easy as well with recursive SQL:

-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
),
-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID, CALC) AS (
SELECT
WORK_AMT,
ID,
TO_CHAR(WORK_AMT)
FROM WORK
UNION ALL
SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID,
CALC || '+' || WORK_AMT
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)
-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM,
MIN (CALC) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CALCULATION
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT

SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features: