These two SQL standard language features are among the most powerful SQL features that are available from most other databases. I frequently include them in conference talks about SQL (see my article about 10 SQL Tricks That You Didn’t Think Were Possible), and as well in the Data Geekery SQL Masterclass. With MySQL 8.0 now supporting these exciting features, the masterclass will be including MySQL as well (along with Oracle, SQL Server, PostgreSQL, and DB2). And, of course, these features are now supported in the upcoming jOOQ 3.10 as well.

Then, connect to this instance and run this nice little query in it:

WITH RECURSIVE t(a, b) AS (
SELECT 1, CAST('a' AS CHAR(15))
UNION ALL
SELECT t.a + 1, CONCAT(t.b, 'a')
FROM t
WHERE t.a < 10
)
SELECT a, SUM(a) OVER (ORDER BY a) AS ∑, b
FROM t

And get this result:

a ∑ b
--------------------------
1 1 a
2 3 aa
3 6 aaa
4 10 aaaa
5 15 aaaaa
6 21 aaaaaa
7 28 aaaaaaa
8 36 aaaaaaaa
9 45 aaaaaaaaa
10 55 aaaaaaaaaa

Would you believe this is MySQL?

Bonus

A nice “hidden” feature is the support of new pessimistic locking clauses, in particular FOR UPDATE SKIP LOCKED. This has been available in Oracle for ages and since recently in PostgreSQL as well, and now in MySQL. A very useful feature when implementing simple message queues or reservation systems. More details in this article here:

Listicles like these do work – not only do they attract attention, if the content is also valuable (and in this case it is, trust me), the article format can be extremely entertaining.

This article will bring you 10 SQL tricks that many of you might not have thought were possible. The article is a summary of my new, extremely fast-paced, ridiculously childish-humoured talk, which I’m giving at conferences (recently at JAX, and Devoxx France. You may quote me on this:

Here are 10 SQL Tricks That You Didn’t Think Were Possible:

Introduction

In order to understand the value of these 10 SQL tricks, it is first important to understand the context of the SQL language. Why do I talk about SQL at Java conferences? (and I’m usually the only one!) This is why:

From early days onwards, programming language designers had this desire to design languages in which you tell the machine WHAT you want as a result, not HOW to obtain it. For instance, in SQL, you tell the machine that you want to “connect” (JOIN) the user table and the address table and find the users that live in Switzerland. You don’t care HOW the database will retrieve this information (e.g. should the users table be loaded first, or the address table? Should the two tables be joined in a nested loop or with a hashmap? Should all data be loaded in memory first and then filtered for Swiss users, or should we only load Swiss addresses in the first place? Etc.)

As with every abstraction, you will still need to know the basics of what’s going on behind the scenes in a database to help the database make the right decisions when you query it. For instance, it makes sense to:

Establish a formal foreign key relationship between the tables (this tells the database that every address is guaranteed to have a corresponding user)

Add an index on the search field: The country (this tells the database that specific countries can be found in O(log N) instead of O(N))

But once your database and your application matures, you will have put all the important meta data in place and you can focus on your business logic only. The following 10 tricks show amazing functionality written in only a few lines of declarative SQL, producing simple and also complex output.

1. Everything is a Table

This is the most trivial of tricks, and not even really a trick, but it is fundamental to a thorough understanding of SQL: Everything is a table! When you see a SQL statement like this:

SELECT *
FROM person

… you will quickly spot the table person sitting right there in the FROM clause. That’s cool, that is a table. But did you realise that the whole statement is also a table? For instance, you can write:

SELECT *
FROM (
SELECT *
FROM person
) t

And now, you have created what is called a “derived table” – i.e. a nested SELECT statement in a FROM clause.

That’s trivial, but if you think of it, quite elegant. You can also create ad-hoc, in-memory tables with the VALUES() constructor as such, in some databases (e.g. PostgreSQL, SQL Server):

SELECT *
FROM (
VALUES(1),(2),(3)
) t(a)

Which simply yields:

a
---
1
2
3

If that clause is not supported, you can revert to derived tables, e.g. in Oracle:

SELECT *
FROM (
SELECT 1 AS a FROM DUAL UNION ALL
SELECT 2 AS a FROM DUAL UNION ALL
SELECT 3 AS a FROM DUAL
) t

Now that you’re seeing that VALUES() and derived tables are really the same thing, conceptually, let’s review the INSERT statement, which comes in two flavours:

-- SQL Server, PostgreSQL, some others:
INSERT INTO my_table(a)
VALUES(1),(2),(3);
-- Oracle, many others:
INSERT INTO my_table(a)
SELECT 1 AS a FROM DUAL UNION ALL
SELECT 2 AS a FROM DUAL UNION ALL
SELECT 3 AS a FROM DUAL

In SQL everything is a table. When you’re inserting rows into a table, you’re not really inserting individual rows. You’re really inserting entire tables. Most people just happen to insert a single-row-table most of the time, and thus don’t realise what INSERT really does.

Everything is a table. In PostgreSQL, even functions are tables:

SELECT *
FROM substring('abcde', 2, 3)

The above yields:

substring
---------
bcd

If you’re programming in Java, you can use the analogy of the Java 8 Stream API to take this one step further. Consider the following equivalent concepts:

TABLE : Stream<Tuple<..>>
SELECT : map()
DISTINCT : distinct()
JOIN : flatMap()
WHERE / HAVING : filter()
GROUP BY : collect()
ORDER BY : sorted()
UNION ALL : concat()

With Java 8, “everything is a Stream” (as soon as you start working with Streams, at least). No matter how you transform a stream, e.g. with map() or filter(), the resulting type is always a Stream again.

Common Table Expressions (also: CTE, also referred to as subquery factoring, e.g. in Oracle) are the only way to declare variables in SQL (apart from the obscure WINDOW clause that only PostgreSQL and Sybase SQL Anywhere know).

This is a powerful concept. Extremely powerful. Consider the following statement:

-- Table variables
WITH
t1(v1, v2) AS (SELECT 1, 2),
t2(w1, w2) AS (
SELECT v1 * 2, v2 * 2
FROM t1
)
SELECT *
FROM t1, t2

It yields

v1 v2 w1 w2
-----------------
1 2 2 4

Using the simple WITH clause, you can specify a list of table variables (remember: everything is a table), which may even depend on each other.

That is easy to understand. This makes CTE (Common Table Expressions) already very useful, but what’s really really awesome is that they’re allowed to be recursive! Consider the following PostgreSQL example:

WITH RECURSIVE t(v) AS (
SELECT 1 -- Seed Row
UNION ALL
SELECT v + 1 -- Recursion
FROM t
)
SELECT v
FROM t
LIMIT 5

It yields

v
---
1
2
3
4
5

How does it work? It’s relatively easy, once you see through the many keywords. You define a common table expression that has exactly two UNION ALL subqueries.

The first UNION ALL subquery is what I usually call the “seed row”. It “seeds” (initialises) the recursion. It can produce one or several rows on which we will recurse afterwards. Remember: everything is a table, so our recursion will happen on a whole table, not on an individual row/value.

The second UNION ALL subquery is where the recursion happens. If you look closely, you will observe that it selects from t. I.e. the second subquery is allowed to select from the very CTE that we’re about to declare. Recursively. It thus has also access to the column v, which is being declared by the CTE that already uses it.

WITH RECURSIVE q(r, i, rx, ix, g) AS (
SELECT r::DOUBLE PRECISION * 0.02, i::DOUBLE PRECISION * 0.02,
.0::DOUBLE PRECISION , .0::DOUBLE PRECISION, 0
FROM generate_series(-60, 20) r, generate_series(-50, 50) i
UNION ALL
SELECT r, i, CASE WHEN abs(rx * rx + ix * ix) <= 2 THEN rx * rx - ix * ix END + r,
CASE WHEN abs(rx * rx + ix * ix) <= 2 THEN 2 * rx * ix END + i, g + 1
FROM q
WHERE rx IS NOT NULL AND g < 99
)
SELECT array_to_string(array_agg(s ORDER BY r), '')
FROM (
SELECT i, r, substring(' .:-=+*#%@', max(g) / 10 + 1, 1) s
FROM q
GROUP BY i, r
) q
GROUP BY i
ORDER BY i

Run the above on PostgreSQL, and you’ll get something like

This blog is full of running total examples. They’re some of the most educational examples to learn about advanced SQL, because there are at least a dozen of ways how to implement a running total.

A running total is easy to understand, conceptually.

In Microsoft Excel, you would simply calculate a sum (or difference) of two previous (or subsequent) values, and then use the useful crosshair cursor to pull that formula through your entire spreadsheet. You “run” that total through the spreadsheet. A “running total”.

Window functions are a powerful concept – not so easy to understand at first, but in fact, they’re really really easy:

Window functions are aggregations / rankings on a subset of rows relative to the current row being transformed by SELECT

That’s it. 🙂

What it essentially means is that a window function can perform calculations on rows that are “above” or “below” the current row. Unlike ordinary aggregations and GROUP BY, however, they don’t transform the rows, which makes them very useful.

The syntax can be summarised as follows, with individual parts being optional

function(...) OVER (
PARTITION BY ...
ORDER BY ...
ROWS BETWEEN ... AND ...
)

So, we have any sort of function (we’ll see examples for such functions later), followed by this OVER() clause, which specifies the window. I.e. this OVER() clause defines:

The PARTITION: Only rows that are in the same partition as the current row will be considered for the window

The ORDER: The window can be ordered independently of what we’re selecting

The ROWS (or RANGE) frame definition: The window can be restricted to a fixed amount of rows “ahead” and “behind”

That’s all there is to window functions.

Now how does that help us calculate a running total? Consider the following data:

Let’s assume that BALANCE is what we want to calculate from AMOUNT

Intuitively, we can immediately see that the following holds true:

So, in plain English, any balance can be expressed with the following pseudo SQL:

TOP_BALANCE - SUM(AMOUNT) OVER (
"all the rows on top of the current row"
)

In real SQL, that would then be written as follows:

SUM(t.amount) OVER (
PARTITION BY t.account_id
ORDER BY t.value_date DESC,
t.id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
)

Explanation:

The partition will calculate the sum for each bank account, not for the entire data set

The ordering will guarantee that transactions are ordered (within the partition) prior to summing

The rows clause will consider only preceding rows (within the partition, given the ordering) prior to summing

All of this will happen in-memory over the data set that has already been selected by you in your FROM .. WHERE etc. clauses, and is thus extremely fast.

Intermezzo

Before we move on to all the other awesome tricks, consider this: We’ve seen

(Recursive) Common Table Expressions (CTE)

Window functions

Both of these features are:

Awesome

Exremely powerful

Declarative

Part of the SQL standard

Available in most popular RDBMS (except MySQL)

Very important building blocks

If anything can be concluded from this article, it is the fact that you should absolutely know these two building blocks of modern SQL. Why? Because:

4. Finding the Largest Series with no Gaps

Stack Overflow has this very nice feature to motivate people to stay on their website for as long as possible. Badges:

For scale, you can see how many badges I have. Tons.

How do you calculate these badges? Let’s have a look at the “Enthusiast” and the “Fanatic”. These badges are awarded to anyone who spends a given amount of consecutive days on their platform. Regardless of any wedding date or wife’s birthday, you HAVE TO LOG IN, or the counter starts from zero again.

Now as we’re doing declarative programming, we don’t care about maintaining any state and in-memory counters. We want to express this in the form of online analytic SQL. I.e. consider this data:

Wow. Interesting. So, 14 - 1 = 13, 15 - 2 = 13, 16 - 3 = 13, but 18 - 4 = 14. No one can say it better than Doge:

There’s a simple example for this behaviour:

ROW_NUMBER() never has gaps. That’s how it’s defined

Our data, however, does

So when we subtract a “gapless” series of consecutive integers from a “gapful” series of non-consecutive dates, we will get the same date for each “gapless” subseries of consecutive dates, and we’ll get a new date again where the date series had gaps.

Huh.

This means we can now simply GROUP BY this arbitrary date value:

SELECT
min(login_date), max(login_date),
max(login_date) -
min(login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC

And we’re done. The largest series of consecutive dates with no gaps has been found:

WITH
login_dates AS (
SELECT DISTINCT cast(login_time AS DATE) login_date
FROM logins WHERE user_id = :user_id
),
login_date_groups AS (
SELECT
login_date,
login_date - row_number() OVER (ORDER BY login_date) AS grp
FROM login_dates
)
SELECT
min(login_date), max(login_date),
max(login_date) - min(login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC

Not that hard in the end, right? Of course, having the idea makes all the difference, but the query itself is really very very simple and elegant. No way you could implement some imperative-style algorithm in a leaner way than this.

Whew.

5. Finding the Length of a Series

Previously, we had seen series of consecutive values. That’s easy to deal with as we can abuse of the consecutiveness of integers. What if the definition of a “series” is less intuitive, and in addition to that, several series contain the same values? Consider the following data, where LENGTH is the length of each series that we want to calculate:

Yes, you’ve guessed right. A “series” is defined by the fact that consecutive (ordered by ID) rows have the same SIGN(AMOUNT). Check again the data formatted as below:

In this table, we want to copy the row number value into “LO” at the “lower” end of a series, and into “HI” at the “upper” end of a series. For this we’ll be using the magical LEAD() and LAG(). LEAD() can access the n-th next row from the current row, whereas LAG() can access the n-th previous row from the current row. For example:

SELECT
lag(v) OVER (ORDER BY v),
v,
lead(v) OVER (ORDER BY v)
FROM (
VALUES (1), (2), (3), (4)
) t(v)

The above query produces:

That’s awesome! Remember, with window functions, you can perform rankings or aggregations on a subset of rows relative to the current row. In the case of LEAD() and LAG(), we simply access a single row relative to the current row, given its offset. This is useful in so many cases.

Continuing with our “LO” and “HI” example, we can simply write:

SELECT
trx.*,
CASE WHEN lag(sign)
OVER (ORDER BY id DESC) != sign
THEN rn END AS lo,
CASE WHEN lead(sign)
OVER (ORDER BY id DESC) != sign
THEN rn END AS hi,
FROM trx

… in which we compare the “previous” sign (lag(sign)) with the “current” sign (sign). If they’re different, we put the row number in “LO”, because that’s the lower bound of our series.

Then we compare the “next” sign (lead(sign)) with the “current” sign (sign). If they’re different, we put the row number in “HI”, because that’s the upper bound of our series.

Finally, a little boring NULL handling to get everything right, and we’re done:

SELECT -- With NULL handling...
trx.*,
CASE WHEN coalesce(lag(sign)
OVER (ORDER BY id DESC), 0) != sign
THEN rn END AS lo,
CASE WHEN coalesce(lead(sign)
OVER (ORDER BY id DESC), 0) != sign
THEN rn END AS hi,
FROM trx

Next step. We want “LO” and “HI” to appear in ALL rows, not just at the “lower” and “upper” bounds of a series. E.g. like this:

We’re using a feature that is available at least in Redshift, Sybase SQL Anywhere, DB2, Oracle. We’re using the “IGNORE NULLS” clause that can be passed to some window functions:

SELECT
trx.*,
last_value (lo) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS lo,
first_value(hi) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS hi
FROM trx

A lot of keywords! But the essence is always the same. From any given “current” row, we look at all the “previous values” (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), but ignoring all the nulls. From those previous values, we take the last value, and that’s our new “LO” value. In other words, we take the “closest preceding” “LO” value.

The same with “HI”. From any given “current” row, we look at all the “subsequent values” (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING), but ignoring all the nulls. From the subsequent values, we take the first value, and that’s our new “HI” value. In other words, we take the “closest following” “HI” value.

Explained in Powerpoint:

Getting it 100% correct, with a little boring NULL fiddling:

SELECT -- With NULL handling...
trx.*,
coalesce(last_value (lo) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW), rn) AS lo,
coalesce(first_value(hi) IGNORE NULLS OVER (
ORDER BY id DESC
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING), rn) AS hi
FROM trx

Finally, we’re just doing a trivial last step, keeping in mind off-by-1 errors:

WITH
trx1(id, amount, sign, rn) AS (
SELECT id, amount, sign(amount), row_number() OVER (ORDER BY id DESC)
FROM trx
),
trx2(id, amount, sign, rn, lo, hi) AS (
SELECT trx1.*,
CASE WHEN coalesce(lag(sign) OVER (ORDER BY id DESC), 0) != sign
THEN rn END,
CASE WHEN coalesce(lead(sign) OVER (ORDER BY id DESC), 0) != sign
THEN rn END
FROM trx1
)
SELECT
trx2.*, 1
- last_value (lo) IGNORE NULLS OVER (ORDER BY id DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
+ first_value(hi) IGNORE NULLS OVER (ORDER BY id DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM trx2

Huh. This SQL thing does start getting interesting!

Ready for more?

6. The subset sum problem with SQL

This is my favourite!

What is the subset sum problem? Find a fun explanation here: https://xkcd.com/287

How to do it with SQL? Easy. Just create a CTE that contains all the 2^{n}*possible* sums and then find the closest one for each TOTAL:

-- All the possible 2N sums
WITH sums(sum, max_id, calc) AS (...)
-- Find the best sum per “TOTAL”
SELECT
totals.total,
something_something(total - sum) AS best,
something_something(total - sum) AS calc
FROM draw_the_rest_of_the_*bleep*_owl

As you’re reading this, you might be like my friend here:

But don’t worry, the solution is – again – not all that hard (although it doesn’t perform due to the nature of the algorithm):

WITH sums(sum, id, calc) AS (
SELECT item, id, to_char(item) FROM items
UNION ALL
SELECT item + sum, items.id, calc || ' + ' || item
FROM sums JOIN items ON sums.id < items.id
)
SELECT
totals.id,
totals.total,
min (sum) KEEP (
DENSE_RANK FIRST ORDER BY abs(total - sum)
) AS best,
min (calc) KEEP (
DENSE_RANK FIRST ORDER BY abs(total - sum)
) AS calc,
FROM totals
CROSS JOIN sums
GROUP BY totals.id, totals.total

In this article, I won’t explain the details of this solution, because the example has been taken from a previous article that you can find here:

Enjoy reading the details, but be sure to come back here for the remaining 4 tricks:

7. Capping a Running Total

So far, we’ve seen how to calculate an “ordinary” running total with SQL using window functions. That was easy. Now, how about if we cap the running total such that it never goes below zero? Essentially, we want to calculate this:

So, when that big negative AMOUNT-20000 was subtracted, instead of displaying the real TOTAL of -15550, we simply display 0. In other words (or data sets):

Exactly. With obscure, vendor-specific SQL. In this case, we’re using Oracle SQL

How does it work? Surprisingly easy!

Just add MODEL in your SELECT, and you’re opening up a can of awesome SQL worms!

SELECT ... FROM some_table
-- Put this after any table
MODEL ...

Once we put MODEL there, we can implement spreadsheet logic directly in our SQL statements, just as with Microsoft Excel.

The following three clauses are the most useful and widely used (i.e. 1-2 per year by anyone on this planet):

MODEL
-- The spreadsheet dimensions
DIMENSION BY ...
-- The spreadsheet cell type
MEASURES ...
-- The spreadsheet formulas
RULES ...

The meaning of each of these three additional clauses is best explained with slides again.

The DIMENSION BY clause specifies the dimensions of your spreadsheet. Unlike in MS Excel, you can have any number of dimensions in Oracle:

The MEASURES clause specifies the values that are available in each cell of your spreadsheet. Unlike in MS Excel, you can have a whole tuple in each cell in Oracle, not just a single value.

The RULES clause specifies the formulas that apply to each cell in your spreadsheet. Unlike in MS Excel, these rules / formulas are centralised at a single place, instead of being put inside of each cell:

This design makes MODEL a bit harder to use than MS Excel, but much more powerful, if you dare. The whole query will then be “trivially”:

SELECT *
FROM (
SELECT date, amount, 0 AS total
FROM amounts
)
MODEL
DIMENSION BY (row_number() OVER (ORDER BY date) AS rn)
MEASURES (date, amount, total)
RULES (
total[any] = greatest(0,
coalesce(total[cv(rn) - 1], 0) + amount[cv(rn)])
)

This whole thing is so powerful, it ships with its own whitepaper by Oracle, so rather than explaining things further here in this article, please do read the excellent whitepaper:

If you’re into fraud detection or any other field that runs real time analytics on large data sets, time series pattern recognition is certainly not a new term to you.

If we review the “length of a series” data set, we might want to generate triggers on complex events over our time series as such:

Trigger on the 3^{rd} repetition of an event if the event occurs more than 3 times.

Similar to the previous MODEL clause, we can do this with an Oracle-specific clause that was added to Oracle 12c:

SELECT ... FROM some_table
-- Put this after any table to pattern-match
-- the table’s contents
MATCH_RECOGNIZE (...)

The simplest possible application of MATCH_RECOGNIZE includes the following subclauses:

SELECT *
FROM series
MATCH_RECOGNIZE (
-- Pattern matching is done in this order
ORDER BY ...
-- These are the columns produced by matches
MEASURES ...
-- A short specification of what rows are
-- returned from each match
ALL ROWS PER MATCH
-- «Regular expressions» of events to match
PATTERN (...)
-- The definitions of «what is an event»
DEFINE ...
)

That sounds crazy. Let’s look at some example clause implementations

SELECT *
FROM series
MATCH_RECOGNIZE (
ORDER BY id
MEASURES classifier() AS trg
ALL ROWS PER MATCH
PATTERN (S (R X R+)?)
DEFINE
R AS sign(R.amount) = prev(sign(R.amount)),
X AS sign(X.amount) = prev(sign(X.amount))
)

What do we do here?

We order the table by ID, which is the order in which we want to match events. Easy.

We then specify the values that we want as a result. We want the “MEASURE” trg, which is defined as the classifier, i.e. the literal that we’ll use in the PATTERN afterwards. Plus we want all the rows from a match.

We then specify a regular expression-like pattern. The pattern is an event “S” for Start, followed optionally by “R” for Repeat, “X” for our special event X, followed by one or more “R” for repeat again. If the whole pattern matches, we get SRXR or SRXRR or SRXRRR, i.e. X will be at the third position of a series of length >= 4

Finally, we define R and X as being the same thing: The event when SIGN(AMOUNT) of the current row is the same as SIGN(AMOUNT) of the previous row. We don’t have to define “S”. “S” is just any other row.

This query will magically produce the following output:

| ID | VALUE_DATE | AMOUNT | TRG |
|------|------------|---------|-----|
| 9997 | 2014-03-18 | + 99.17 | S |
| 9981 | 2014-03-16 | - 71.44 | R |
| 9979 | 2014-03-16 | - 94.60 | X |
| 9977 | 2014-03-16 | - 6.96 | R |
| 9971 | 2014-03-15 | - 65.95 | S |
| 9964 | 2014-03-15 | + 15.13 | S |
| 9962 | 2014-03-15 | + 17.47 | S |
| 9960 | 2014-03-15 | + 3.55 | S |
| 9959 | 2014-03-14 | - 32.00 | S |

We can see a single “X” in our event stream. Exactly where we had expected it. At the 3rd repetition of an event (same sign) in a series of length > 3.

Boom!

As we don’t really care about “S” and “R” events, let’s just remove them as such:

SELECT
id, value_date, amount,
CASE trg WHEN 'X' THEN 'X' END trg
FROM series
MATCH_RECOGNIZE (
ORDER BY id
MEASURES classifier() AS trg
ALL ROWS PER MATCH
PATTERN (S (R X R+)?)
DEFINE
R AS sign(R.amount) = prev(sign(R.amount)),
X AS sign(X.amount) = prev(sign(X.amount))
)

Again, don’t expect me to explain this any better than the excellent Oracle whitepaper already did, which I strongly recommend reading if you’re using Oracle 12c anyway:

If you’ve read this far, the following will be almost too embarassingly simple:

This is our data, i.e. actors, film titles, and film ratings:

| NAME | TITLE | RATING |
|-----------|-----------------|--------|
| A. GRANT | ANNIE IDENTITY | G |
| A. GRANT | DISCIPLE MOTHER | PG |
| A. GRANT | GLORY TRACY | PG-13 |
| A. HUDSON | LEGEND JEDI | PG |
| A. CRONYN | IRON MOON | PG |
| A. CRONYN | LADY STAGE | PG |
| B. WALKEN | SIEGE MADRE | R |

Observe how we kinda grouped by the actors and then “pivoted” the number films per rating each actor played in. Instead of displaying this in a “relational” way, (i.e. each group is a row) we pivoted the whole thing to produce a column per group. We can do this, because we know all the possible groups in advance.

Unpivoting is the opposite, when from the above, we want to get back to the “row per group” representation:

| NAME | RATING | COUNT |
|-----------|--------|-------|
| A. GRANT | NC-17 | 3 |
| A. GRANT | PG | 6 |
| A. GRANT | G | 5 |
| A. GRANT | PG-13 | 3 |
| A. GRANT | R | 6 |
| A. HUDSON | NC-17 | 12 |
| A. HUDSON | PG | 4 |

It’s actually really easy. This is how we’d do it in PostgreSQL:

SELECT
first_name, last_name,
count(*) FILTER (WHERE rating = 'NC-17') AS "NC-17",
count(*) FILTER (WHERE rating = 'PG' ) AS "PG",
count(*) FILTER (WHERE rating = 'G' ) AS "G",
count(*) FILTER (WHERE rating = 'PG-13') AS "PG-13",
count(*) FILTER (WHERE rating = 'R' ) AS "R"
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id

We can append a simple FILTER clause to an aggregate function in order to count only some of the data.

In all other databases, we’d do it like this:

SELECT
first_name, last_name,
count(CASE rating WHEN 'NC-17' THEN 1 END) AS "NC-17",
count(CASE rating WHEN 'PG' THEN 1 END) AS "PG",
count(CASE rating WHEN 'G' THEN 1 END) AS "G",
count(CASE rating WHEN 'PG-13' THEN 1 END) AS "PG-13",
count(CASE rating WHEN 'R' THEN 1 END) AS "R"
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
JOIN film AS f USING (film_id)
GROUP BY actor_id

The nice thing here is that aggregate functions usually only consider non-NULL values, so if we make all the values NULL that are not interesting per aggregation, we’ll get the same result.

Now, if you’re using either SQL Server, or Oracle, you can use the built-in PIVOT or UNPIVOT clauses instead. Again, as with MATCH_RECOGNIZE, just append this new keyword after a table and get the same result:

-- PIVOTING
SELECT something, something
FROM some_table
PIVOT (
count(*) FOR rating IN (
'NC-17' AS "NC-17",
'PG' AS "PG",
'G' AS "G",
'PG-13' AS "PG-13",
'R' AS "R"
)
)
-- UNPIVOTING
SELECT something, something
FROM some_table
UNPIVOT (
count FOR rating IN (
"NC-17" AS 'NC-17',
"PG" AS 'PG',
"G" AS 'G',
"PG-13" AS 'PG-13',
"R" AS 'R'
)
)

Easy. Next.

10. Abusing XML and JSON

First off

JSON is just XML with less features and less syntax

Now, everyone knows that XML is awesome. The corollary is thus:

JSON is less awesome

Don’t use JSON.

Now that we’ve settled this, we can safely ignore the ongoing JSON-in-the-database-hype (which most of you will regret in five years anyway), and move on to the final example. How to do XML in the database.

This is what we want to do:

Given the original XML document, we want to parse that document, unnest the comma-separated list of films per actor, and produce a denormalised representation of actors/films in a single relation.

Ready. Set. Go. This is the idea. We have three CTE:

WITH RECURSIVE
x(v) AS (SELECT '...'::xml),
actors(
actor_id, first_name, last_name, films
) AS (...),
films(
actor_id, first_name, last_name,
film_id, film
) AS (...)
SELECT *
FROM films

In the first one, we simply parse the XML. Here with PostgreSQL:

WITH RECURSIVE
x(v) AS (SELECT '
<actors>
<actor>
<first-name>Bud</first-name>
<last-name>Spencer</last-name>
<films>God Forgives... I Don’t, Double Trouble, They Call Him Bulldozer</films>
</actor>
<actor>
<first-name>Terence</first-name>
<last-name>Hill</last-name>
<films>God Forgives... I Don’t, Double Trouble, Lucky Luke</films>
</actor>
</actors>'::xml),
actors(actor_id, first_name, last_name, films) AS (...),
films(actor_id, first_name, last_name, film_id, film) AS (...)
SELECT *
FROM films

Easy.

Then, we do some XPath magic to extract the individual values from the XML structure and put those into columns:

WITH RECURSIVE
x(v) AS (SELECT '...'::xml),
actors(actor_id, first_name, last_name, films) AS (
SELECT
row_number() OVER (),
(xpath('//first-name/text()', t.v))[1]::TEXT,
(xpath('//last-name/text()' , t.v))[1]::TEXT,
(xpath('//films/text()' , t.v))[1]::TEXT
FROM unnest(xpath('//actor', (SELECT v FROM x))) t(v)
),
films(actor_id, first_name, last_name, film_id, film) AS (...)
SELECT *
FROM films

Still easy.

Finally, just a bit of recursive regular expression pattern matching magic, and we’re done!

WITH RECURSIVE
x(v) AS (SELECT '...'::xml),
actors(actor_id, first_name, last_name, films) AS (...),
films(actor_id, first_name, last_name, film_id, film) AS (
SELECT actor_id, first_name, last_name, 1,
regexp_replace(films, ',.+', '')
FROM actors
UNION ALL
SELECT actor_id, a.first_name, a.last_name, f.film_id + 1,
regexp_replace(a.films, '.*' || f.film || ', ?(.*?)(,.+)?', '\1')
FROM films AS f
JOIN actors AS a USING (actor_id)
WHERE a.films NOT LIKE '%' || f.film
)
SELECT *
FROM films

Let’s conclude:

Conclusion

All of what this article has shown was declarative. And relatively easy. Of course, for the fun effect that I’m trying to achieve in this talk, some exaggerated SQL was taken and I expressly called everything “easy”. It’s not at all easy, you have to practice SQL. Like many other languages, but a bit harder because:

The syntax is a bit awkward from time to time

Declarative thinking is not easy. At least, it’s very different

But once you get a hang of it, declarative programming with SQL is totally worth it as you can express complex relationships between your data in very very little code by just describing the result you want to get from the database.

Isn’t that awesome?

And if that was a bit over the top, do note that I’m happy to visit your JUG / conference to give this talk (just contact us), or if you want to get really down into the details of these things, we also offer this talk as a public or in-house workshop. Do get in touch! We’re looking forward.

In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:

The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:

ID ASSIGN_AMT
--------------
1 25150
2 19800
3 27511

The ASSIGN_AMT value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of WORK_AMT values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:

The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by ID. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold

The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.

The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique!

But let’s look at the simpler understanding first:

1. Calculating a sum of an ordered subset of values

By strictly ordered we mean (in the sense of the question) that we want to order all WORK_AMT values, e.g. by ID, and allow only for sums that can appear in this particular order. I.e.

WITH
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
)
SELECT
ID,
WORK_AMT,
SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM
FROM
VALS
ORDER BY
ID

The above window function calculates the sum of all WORK_AMT values that are in the “window” of values where the ID is smaller or equal to the current ID.

Finding the “closest” of these sums with quantified comparison predicates

Now, the task at hand is to find for each value ASSIGN_AMT in 25150, 19800, and 27511 the closest value of SUBSET_SUM. In a way, what we are trying to do is we want to minimise the expression ABS(SUBSET_SUM - ASSIGN_AMT).

However, the MIN() aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of SUBSET_SUM that produces this difference in the first place.

One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:

-- The common table expressions are the same as
-- in the previous examples
WITH
ASSIGN(ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
),
SUMS (ID, WORK_AMT, SUBSET_SUM) AS (
SELECT
VALS.*,
SUM (WORK_AMT) OVER (ORDER BY ID)
FROM
VALS
)
-- This is now the interesting part, where we
-- calculate the closest sum
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM)
FROM
SUMS
)

The quantified comparison predicate reads intuitively. We’re looking for that specific SUBSET_SUM whose difference to the “expected” ASSIGN_AMT is smaller or equal to all the other possible differences.

You may have noticed that the solution is not entirely correct in the event where ASSIGN_AMT is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the JOIN. This can be achieved when replacing:

One solution is to remove those duplicates using DISTINCT (which is an anti-pattern. See #6 in this list). A better solution is to make the JOIN predicate unambiguous by comparing ID values as well, i.e.:

SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
(ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM),
ID
FROM
SUMS
)

The above unfortunately doesn’t work in Oracle (yet), which will report an error:

ORA-01796: this operator cannot be used with lists

Finding the “closest” of these sums with Oracle’s FIRST function

Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:

SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT

Essentially, we’re grouping all values from the SUMS table for each ASSIGN_AMT. For each of those groups, we’ll look for the "FIRST" row that appears when ordering rows within the group by our criteria ABS(ASSIGN_AMT - SUBSET_SUM). We "KEEP" only those rows in the group, and retain from those rows the minimum SUBSET_SUM.

SELECT DISTINCT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
FIRST_VALUE (SUBSET_SUM) OVER (
PARTITION BY ASSIGN.ID
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS

Unfortunately, these solutions all produce duplicates, which we have to remove either via GROUP BY (KEEP solution), or via DISTINCT (FIRST_VALUE solution).

Finding the “closest” of these sums with LATERAL JOIN

A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new FETCH FIRST clause along with CROSS JOIN LATERAL (or CROSS APPLY, which is the same)

SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN LATERAL (
SELECT
SUBSET_SUM AS CLOSEST_SUM
FROM
SUMS
ORDER BY
ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM)
FETCH FIRST 1 ROW ONLY
) SUMS

What does this mean? We’re essentially joining for each value in ASSIGN only the FIRST 1 ROW in SUMS, ordered by the usual criteria. We need LATERAL (or APPLY), because this allows us to reference columns from the left side of the JOIN expression also in the right side, which wouldn’t be possible otherwise.

The same functionality is supported in SQL Server (only using CROSS APPLY), or in PostgreSQL (only using CROSS JOIN LATERAL).

Lateral can be very useful whenever the right hand side of a JOIN depends on the left hand side. Unlike ordinary joins, this means that the JOIN order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with ORDER BY and FETCH FIRST), or when joining unnested table-valued functions, which we’ll cover in a future blog post.

2. Calculating a sum of any subset of values

So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of WORK_AMT values.

We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:

In recursive SQL, we need a UNION ALL query in a common table expression (WITH clause in Oracle, or WITH RECURSIVE clause in PostgreSQL). The first subquery of UNION ALL generates the “seed row(s)” of the recursion, and the second subqeury of UNION ALL generates the recursion based on a SELECT that selects from the table being declared, recursively.

So, a naive solution to this subset sum problem can be seen here:

-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
),
-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID) AS (
SELECT
WORK_AMT,
ID
FROM
WORK
UNION ALL
SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)
-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT

The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in WORK:

SELECT
WORK_AMT,
ID
FROM
WORK

In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (SUMS) with all the remaining values (WORK), i.e. all the values that have a higher ID:

SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID

In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative.

The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the SUMS table.

After a still acceptable 0.112s for 10 different WORK_AMT values, the database calculated:

But beware, as soon as you start adding values to the VALS table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightly bigger WORK table already requires 16.3 seconds to yield a result:

WORK(ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
UNION ALL SELECT 11, 1234 FROM DUAL
UNION ALL SELECT 12, 61 FROM DUAL
UNION ALL SELECT 13, 616 FROM DUAL
UNION ALL SELECT 14, 2456 FROM DUAL
UNION ALL SELECT 15, 5161 FROM DUAL
UNION ALL SELECT 16, 414 FROM DUAL
UNION ALL SELECT 17, 516 FROM DUAL
UNION ALL SELECT 18, 617 FROM DUAL
UNION ALL SELECT 19, 146 FROM DUAL
),

Want proof about the actual sum? That’s easy as well with recursive SQL:

-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120 FROM DUAL
UNION ALL SELECT 2 , 8150 FROM DUAL
UNION ALL SELECT 3 , 8255 FROM DUAL
UNION ALL SELECT 4 , 9051 FROM DUAL
UNION ALL SELECT 5 , 1220 FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221 FROM DUAL
UNION ALL SELECT 9 , 812 FROM DUAL
UNION ALL SELECT 10, 6562 FROM DUAL
),
-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID, CALC) AS (
SELECT
WORK_AMT,
ID,
TO_CHAR(WORK_AMT)
FROM WORK
UNION ALL
SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID,
CALC || '+' || WORK_AMT
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)
-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM,
MIN (CALC) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CALCULATION
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT

SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features:

Oracle SYNONYMs are a great feature. You can implement all sorts of backwards-compatibility tweaks simply by creating SYNONYMs in your database. Consider the following schema:

CREATE TABLE my_table (col NUMBER(7));
CREATE SYNONYM my_table_old FOR my_table;
CREATE SYNONYM my_table_bak FOR my_table_old;

Now you can query your same old table through three different names, it’ll all result in the same output:

SELECT * FROM my_table;
-- Same thing:
SELECT * FROM my_table_old;
SELECT * FROM my_table_bak;

The trouble is, when you see my_table_bak in code (or some even more obfuscated name), do you immediately know what it really is?

Use this query to find out

We can use the ALL_SYNONYMS table to figure this one out. This query will already give a simple overview:

SELECT *
FROM ALL_SYNONYMS
WHERE TABLE_OWNER = 'PLAYGROUND'

But as you can see, this is boring, because we have transitive synonyms in there and I don’t want to go through the complete table to figure out that MY_TABLE_BAK -> MY_TABLE_OLD -> MY_TABLE.

But let’s see how we can transitively resolve our tables. Here’s how:

SELECT
s.OWNER,
s.SYNONYM_NAME,
-- Get to the root of the hierarchy
CONNECT_BY_ROOT s.TABLE_OWNER TABLE_OWNER,
CONNECT_BY_ROOT s.TABLE_NAME TABLE_NAME
FROM ALL_SYNONYMS s
WHERE s.TABLE_OWNER = 'PLAYGROUND'
-- The magic CONNECT BY clause!
CONNECT BY s.TABLE_OWNER = PRIOR s.OWNER
AND s.TABLE_NAME = PRIOR s.SYNONYM_NAME

First off, there is CONNECT BY, which allows to “connect” hierarchies by their hierarchical predecessors. On each level of the hierarchy, we’ll connect the TABLE_NAME with its previous (“PRIOR”) SYNONYM_NAME. This will recurse as long as the chain doesn’t end (or if it runs into a cycle).

What’s also interesting is the CONNECT_BY_ROOT keyword, which, for each path through the hierarchy, displays the root of the path. In our case, that’s the target TABLE_NAME.

As you can see, some of the results are now synonyms pointing directly to the target table, whereas the last record still points to an intermediate element from the synonym path. This is because we’re recursing into the path hierarchies from every record in the table, also from the “intermediate” synonym references, whose TABLE_NAME is yet another synonym.

Let’s get rid of those as well, using the optional START WITH clause, which allows to limit tree traversals to those trees whose roots fulfil a given predicate:

SELECT
s.OWNER,
s.SYNONYM_NAME,
CONNECT_BY_ROOT s.TABLE_OWNER TABLE_OWNER,
CONNECT_BY_ROOT s.TABLE_NAME TABLE_NAME
FROM ALL_SYNONYMS s
WHERE s.TABLE_OWNER = 'PLAYGROUND'
CONNECT BY s.TABLE_OWNER = PRIOR s.OWNER
AND s.TABLE_NAME = PRIOR s.SYNONYM_NAME
-- Start recursing only from non-synonym objects
START WITH EXISTS (
SELECT 1
FROM ALL_OBJECTS
WHERE s.TABLE_OWNER = ALL_OBJECTS.OWNER
AND s.TABLE_NAME = ALL_OBJECTS.OBJECT_NAME
AND ALL_OBJECTS.OWNER = 'PLAYGROUND'
AND ALL_OBJECTS.OBJECT_TYPE <> 'SYNONYM'
)

So, essentially, we’re requiring the TABLE_NAME to be any object from ALL_OBJECTS that is in our schema, but not a SYNONYM. (yes, synonyms work for all objects, including procedures, packages, types, etc.)

Running the above query gets us the desired result:

Most often, you will not use local synonyms, though, but PUBLIC ones. Oracle has this quirky PUBLIC pseudo-schema, in which you cannot create objects, but in which you can create synonyms. So, let’s create some more synonyms for backwards-compatibility purposes:

CREATE PUBLIC SYNONYM my_table_bak2 FOR my_table_bak;
CREATE SYNONYM bak_backup_old FOR my_table_bak2;

Unfortunately, this will break our chain, because for some reason only Oracle and the Oracle of Delphi knows, PUBLIC is well reported as a OWNER of the synonym, but not as the TABLE_OWNER. Let’s see some raw data with:

SELECT *
FROM ALL_SYNONYMS
WHERE TABLE_OWNER = 'PLAYGROUND'

… and thus:

OWNER SYNONYM_NAME TABLE_OWNER TABLE_NAME
------------------------------------------------------
PLAYGROUND MY_TABLE_OLD PLAYGROUND MY_TABLE
PLAYGROUND MY_TABLE_BAK PLAYGROUND MY_TABLE_OLD
PUBLIC MY_TABLE_BAK2 PLAYGROUND MY_TABLE_BAK
PLAYGROUND BAK_BACKUP_OLD PLAYGROUND MY_TABLE_BAK2 <-- Not PUBLIC

As you can see, the PUBLIC SYNONYM MY_TABLE_BAK2 is reported to be in the PLAYGROUND schema! This breaks recursion, of course. We’re missing a record:

In order to work around this issue, we’ll have to tweak our original data set. Any object reported as (TABLE_OWNER, TABLE_NAME) might in fact be a synonym called ('PUBLIC', TABLE_NAME). The trick is thus to simply duplicate all input data as such:

SELECT
s.OWNER,
s.SYNONYM_NAME,
CONNECT_BY_ROOT s.TABLE_OWNER TABLE_OWNER,
CONNECT_BY_ROOT s.TABLE_NAME TABLE_NAME
-- Tweaked data set
FROM (
SELECT OWNER, SYNONYM_NAME, TABLE_OWNER, TABLE_NAME
FROM ALL_SYNONYMS
UNION ALL
SELECT OWNER, SYNONYM_NAME, 'PUBLIC', TABLE_NAME
FROM ALL_SYNONYMS
) s
-- Add the synthetic PUBLIC TABLE_OWNER as well
WHERE s.TABLE_OWNER IN (
'PLAYGROUND', 'PUBLIC'
)
CONNECT BY s.TABLE_OWNER = PRIOR s.OWNER
AND s.TABLE_NAME = PRIOR s.SYNONYM_NAME
START WITH EXISTS (
SELECT 1
FROM ALL_OBJECTS
WHERE s.TABLE_OWNER = ALL_OBJECTS.OWNER
AND s.TABLE_NAME = ALL_OBJECTS.OBJECT_NAME
AND ALL_OBJECTS.OWNER = 'PLAYGROUND'
AND ALL_OBJECTS.OBJECT_TYPE <> 'SYNONYM'
)

There is also a quirky function called SYS_CONNECT_BY_PATH, which can be used to actually display the whole hierarchy in a string form (VARCHAR2, with max 4000 characters!). Here’s how:

SELECT
-- Magic function
SUBSTR(
sys_connect_by_path(
s.TABLE_OWNER
|| '.'
|| s.TABLE_NAME, ' <- '
) || ' <- '
|| s.OWNER
|| '.'
|| s.SYNONYM_NAME, 5
)
FROM (
SELECT OWNER, SYNONYM_NAME, TABLE_OWNER, TABLE_NAME
FROM ALL_SYNONYMS
UNION ALL
SELECT OWNER, SYNONYM_NAME, 'PUBLIC', TABLE_NAME
FROM ALL_SYNONYMS
) s
WHERE s.TABLE_OWNER IN (
'PLAYGROUND', 'PUBLIC'
)
CONNECT BY s.TABLE_OWNER = PRIOR s.OWNER
AND s.TABLE_NAME = PRIOR s.SYNONYM_NAME
START WITH EXISTS (
SELECT 1
FROM ALL_OBJECTS
WHERE s.TABLE_OWNER = ALL_OBJECTS.OWNER
AND s.TABLE_NAME = ALL_OBJECTS.OBJECT_NAME
AND ALL_OBJECTS.OWNER = 'PLAYGROUND'
AND ALL_OBJECTS.OBJECT_TYPE <> 'SYNONYM'
)

The above query will now output the following records:

If you have “stale” synonyms, i.e. synonyms that point to nowhere, Oracle may report them to be pointing to themselves. That’s unfortunate and creates a CYCLE in CONNECT BY. To prevent this from happening, simply add another predicate like so:

SELECT
SUBSTR(
sys_connect_by_path(
s.TABLE_OWNER
|| '.'
|| s.TABLE_NAME, ' <- '
) || ' <- '
|| s.OWNER
|| '.'
|| s.SYNONYM_NAME, 5
)
FROM (
SELECT * FROM (
SELECT OWNER, SYNONYM_NAME, TABLE_OWNER, TABLE_NAME
FROM ALL_SYNONYMS
UNION ALL
SELECT OWNER, SYNONYM_NAME, 'PUBLIC', TABLE_NAME
FROM ALL_SYNONYMS
) s
-- Add this predicate to prevent cycles
WHERE (s.OWNER , s.SYNONYM_NAME)
!= ((s.TABLE_OWNER , s.TABLE_NAME))
) s
CONNECT BY s.TABLE_OWNER = PRIOR s.OWNER
AND s.TABLE_NAME = PRIOR s.SYNONYM_NAME
START WITH EXISTS (
SELECT 1
FROM ALL_OBJECTS
WHERE s.TABLE_OWNER = ALL_OBJECTS.OWNER
AND s.TABLE_NAME = ALL_OBJECTS.OBJECT_NAME
AND ALL_OBJECTS.OWNER = 'PLAYGROUND'
AND ALL_OBJECTS.OBJECT_TYPE <> 'SYNONYM'
)

Can the above query be written in jOOQ?

Yes of course. In jOOQ, pretty much everything is possible, if you can write it in SQL. Here’s how we use a query similar to the above to resolve Oracle Synonmys in the jOOQ code generator:

If you have an intrinsically hierarchical data set, then you will be very unhappy with these simplistic hierarchical SQL features (also with commont table expressions). They don’t perform very well, and they’re very hard to express if hierarchies get more complex. So you may as well consider using an actual graph database like Neo4j

But every now and then, a little hierarchy may sneak into your otherwise “standard” relational data model. When it does, be sure to have this useful CONNECT BY clause ready for action.

CONNECT BY is supported by (at least):

CUBRID

Informix

Oracle

Recursive common table expressions (the SQL standard’s counterpart for CONNECT BY are supported by (at least):

DB2

Firebird

HSQLDB

Oracle

PostgreSQL

SQL Server

Sybase SQL Anywhere

and…

H2 has some experimental support

In a future post, we’re going to be looking into how to do the same thing with recursive CTE.

Recursive SQL can be awesome, although a bit hard to read in its SQL standard beauty. Let’s assume you have some aggregated data with dates and a number of events per date:

As you may have noticed, there are no records for those dates with zero events (October 06). With recursive SQL, this is rather simple to achieve.

with recursive
-- Data could also be a regular table containing
-- the actual data
data(date, count) as (
select date '2013-10-01', 2 union all
select date '2013-10-02', 1 union all
select date '2013-10-03', 3 union all
select date '2013-10-04', 4 union all
select date '2013-10-05', 2 union all
select date '2013-10-06', 0 union all
select date '2013-10-07', 2
),
-- This is the recursive common table expression
-- It starts with all data where count > 0
-- ... and then recurses by subtracting one
recurse(date, count) as (
select date, count
from data
where count > 0
union all
select date, count - 1
from recurse
where count > 1
)
select date, count event_number from recurse
order by date asc, event_number asc;

See also this SQLFiddle to see the above CTE in action.

Incredibly, Oracle’s CONNECT BY clause doesn’t seem to be an option here. I challenge you to find a better solution, though! For instance, this beautiful solution that works with PostgreSQL:

with recursive
data(date, count) as (
select date '2013-10-01', 2 union all
select date '2013-10-02', 1 union all
select date '2013-10-03', 3 union all
select date '2013-10-04', 4 union all
select date '2013-10-05', 2 union all
select date '2013-10-06', 0 union all
select date '2013-10-07', 2
)
select date, generate_series(1, count) event_number
from data
where count > 0
order by date asc, event_number asc;