## Recursive SQL for Data Normalisation

Recursive SQL can be awesome, although a bit hard to read in its SQL standard beauty. Let’s assume you have some aggregated data with dates and a number of events per date:

|                           DATE | COUNT |
|--------------------------------|-------|
| October, 01 2013 00:00:00+0000 |     2 |
| October, 02 2013 00:00:00+0000 |     1 |
| October, 03 2013 00:00:00+0000 |     3 |
| October, 04 2013 00:00:00+0000 |     4 |
| October, 05 2013 00:00:00+0000 |     2 |
| October, 06 2013 00:00:00+0000 |     0 |
| October, 07 2013 00:00:00+0000 |     2 |

Now let’s assume you want to normalise or “unaggregate” this data, generating “COUNT” records per date. The desired output is this:

|                           DATE | EVENT_NUMBER |
|--------------------------------|--------------|
| October, 01 2013 00:00:00+0000 |            1 |
| October, 01 2013 00:00:00+0000 |            2 |
| October, 02 2013 00:00:00+0000 |            1 |
| October, 03 2013 00:00:00+0000 |            1 |
| October, 03 2013 00:00:00+0000 |            2 |
| October, 03 2013 00:00:00+0000 |            3 |
| October, 04 2013 00:00:00+0000 |            1 |
| October, 04 2013 00:00:00+0000 |            2 |
| October, 04 2013 00:00:00+0000 |            3 |
| October, 04 2013 00:00:00+0000 |            4 |
| October, 05 2013 00:00:00+0000 |            1 |
| October, 05 2013 00:00:00+0000 |            2 |
| October, 07 2013 00:00:00+0000 |            1 |
| October, 07 2013 00:00:00+0000 |            2 |

As you may have noticed, there are no records for those dates with zero events (October 06). With recursive SQL, this is rather simple to achieve.

with recursive

-- Data could also be a regular table containing
-- the actual data
data(date, count) as (
select date '2013-10-01', 2 union all
select date '2013-10-02', 1 union all
select date '2013-10-03', 3 union all
select date '2013-10-04', 4 union all
select date '2013-10-05', 2 union all
select date '2013-10-06', 0 union all
select date '2013-10-07', 2
),

-- This is the recursive common table expression
-- It starts with all data where count > 0
-- ... and then recurses by subtracting one
recurse(date, count) as (
select date, count
from data
where count > 0
union all
select date, count - 1
from recurse
where count > 1
)
select date, count event_number from recurse
order by date asc, event_number asc;

Incredibly, Oracle’s CONNECT BY clause doesn’t seem to be an option here. I challenge you to find a better solution, though! For instance, this beautiful solution that works with PostgreSQL:

with recursive
data(date, count) as (
select date '2013-10-01', 2 union all
select date '2013-10-02', 1 union all
select date '2013-10-03', 3 union all
select date '2013-10-04', 4 union all
select date '2013-10-05', 2 union all
select date '2013-10-06', 0 union all
select date '2013-10-07', 2
)
select date, generate_series(1, count) event_number
from data
where count > 0
order by date asc, event_number asc;

## Emulating the SQL standard derived column list

Derived column lists are a fine feature, if your database supports them. The SQL:2008 standard specifies

7.6 <table reference>

<table reference> ::=
<table factor>
| <joined table>

<table factor> ::=
<table primary> [ <sample clause> ]

<table primary> ::=
<table or query name> [ [ AS ] <correlation name>
[ <left paren> <derived column list> <right paren> ] ]
| <derived table> [ AS ] <correlation name>
[ <left paren> <derived column list> <right paren> ]
| [...]

When you put this in action in an actual query, the above becomes (in Postgres SQL):

SELECT t.a, t.b
FROM (
SELECT 1, 2
) t(a, b)

In words: You can rename a derived table AND its columns in a single step, by supplying a <derived column list> to your <correlation name> (also known as table alias). The main advantage of this syntax is the fact that you only need to know the degree of your derived table, not the concrete (possibly auto-generated) names of its columns. This is particularly useful when renaming columns from an unnested table or from a table or array function:

SELECT t.a, t.b
FROM unnest(my_table_function()) t(a, b)

### Emulating derived column lists

Not all databases support <derived column lists> along with a <correlation name>. But you can always emulate them. The simplest way is to use common table expressions. In Oracle, for instance, you could rewrite the above Postgres SQL statement to this:

-- Postgres
SELECT t.a, t.b
FROM (
SELECT 1, 2
) t(a, b)

-- Equivalent Oracle query
WITH t(a, b) AS (
SELECT 1, 2 FROM DUAL
)
SELECT * FROM t

-- Or a bit more verbose, if you really want to hide the
-- common table expression within the derived table
SELECT t.a, t.b
FROM (
WITH t(a, b) AS (
SELECT 1, 2 FROM DUAL
)
SELECT * FROM t
) t

Note that the CTE solution was given on this Stack Overflow question:
http://stackoverflow.com/q/14127707/521799

If your database supports neither derived column lists, nor common table expressions, you will have to push down the derived column list into the derived table and transform your nested SQL. In MySQL or H2, the above Postgres query would look like this:

-- MySQL, H2, and others
SELECT t.a, t.b
FROM (
SELECT 1 a, 2 b
) t

It looks simple, but of course you’ll have to be able to actually transform your nested select and other sorts of table references

### The solution to rule them all

If the latter isn’t an option for you, here’s one that will always work:

-- All databases
SELECT t.a, t.b
FROM (
SELECT null a, null b FROM DUAL WHERE 1 = 0
UNION ALL
SELECT 1, 2 FROM DUAL
) t

Just concatenate an empty record in front of your actual query, which imposes column names upon your UNION ALL query. Thanks go to Bill’s great answer on Stack Overflow

## Serious SQL: A “convex hull” of “correlated tables”

Now THIS is an interesting, and challenging question on the jOOQ user group:

Say you have a big database with lots of tables and foreign key references. Now you would like to know all tables that are somehow inter-connected by their respective foreign key relationship “paths”. You could call this a “convex hull” around all of your “correlated tables”. Here’s a pseudo-algorithm to achieve this:

// Initialise the hull with an "origin" table
Set tables = {"any table"};
int size = 0;

// Grow the "tables" result until no new tables are added
while (size < tables.size) {
size = tables.size;

for (table in tables) {
}
}

At the end of this algorithm, you would have all tables in the “tables” set, that are somehow connected with the original “any table”.

### Calculate this with jOOQ

With jOOQ’s generated classes, you can easily implement the above algorithm in Java. This would be an example implementation

public class Hull {
public static Set<Table<?>> hull(Table<?>... tables) {
Set<Table<?>> result =
new HashSet<Table<?>>(Arrays.asList(tables));

// Loop as long as there are no new result tables
int size = 0;
while (result.size() > size) {
size = result.size();

for (Table<?> table : new ArrayList<Table<?>>(result)) {

// Follow all outbound foreign keys
for (ForeignKey<?, ?> fk : table.getReferences()) {
}

// Follow all inbound foreign keys from tables
// within the same schema
for (Table<?> other : table.getSchema().getTables()) {
if (other.getReferencesTo(table).size() > 0) {
}
}
}
}

return result;
}

public static void main(String[] args) {
// Calculate the "convex hull" for the T_AUTHOR table
System.out.println(hull(T_AUTHOR));
}
}

### Do it with SQL

Now this still looks straightforward. But we’re SQL pro’s and we love weird queries, so let’s give Oracle SQL a shot at resolving this problem in a single SQL statement. Here goes (warning, some serious SQL ahead)!

-- "graph" denotes an undirected foreign key reference graph
-- for schema "TEST"
with graph as (
select c1.table_name t1, c2.table_name t2
from all_constraints c1
join all_constraints c2
on c1.owner = c2.r_owner
and c1.constraint_name = c2.r_constraint_name
where c1.owner = 'TEST'
union all
select c2.table_name t1, c1.table_name t2
from all_constraints c1
join all_constraints c2
on c1.owner = c2.r_owner
and c1.constraint_name = c2.r_constraint_name
where c1.owner = 'TEST'
),
-- "paths" are all directed paths within that schema
-- as a #-delimited string
paths as (
select sys_connect_by_path(t1, '#') || '#' path
from graph
connect by nocycle prior t1 = t2
),
-- "subgraph" are all those directed paths that go trough
-- a given table T_AUTHOR
subgraph as (
select distinct t.table_name,
regexp_replace(p.path, '^#(.*)#\$', '\1') path
from paths p
cross join all_tables t
where t.owner = 'TEST'
and p.path like '%#' || t.table_name || '#%'
),
-- This XML-trick splits paths and generates rows for every distinct
-- table name
split_paths as (
select distinct table_name origin,
cast(t.column_value.extract('//text()') as varchar2(4000)) table_names
from
subgraph,
table(xmlsequence(xmltype(
'<x><x>' || replace(path, '#', '</x><x>') ||
'</x></x>').extract('//x/*'))) t
),
-- "table_graphs" lists every table and its associated graph
table_graphs as (
select
origin,
count(*) graph_size,
listagg(table_names, ', ') within group (order by 1) table_names
from split_paths
group by origin
)
select
origin,
graph_size "SIZE",
dense_rank() over (order by table_names) id,
table_names
from table_graphs
order by origin

When run against the jOOQ integration test database, this beautiful query will return:

+----------------------+------+----+-----------------------------------------+
| ORIGIN               | SIZE | ID | TABLE_NAMES                             |
+----------------------+------+----+-----------------------------------------+
| T_658_11             |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_658_12             |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_658_21             |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_658_22             |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_658_31             |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_658_32             |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_658_REF            |    7 |  3 | T_658_11, T_658_12, T_658_21, T_658_22, |
|                      |      |    | T_658_31, T_658_32, T_658_REF           |
| T_AUTHOR             |    7 |  1 | T_AUTHOR, T_BOOK, T_BOOK_DETAILS,       |
|                      |      |    | T_BOOK_SALE, T_BOOK_STORE,              |
|                      |      |    | T_BOOK_TO_BOOK_STORE, T_LANGUAGE        |
| T_BOOK               |    7 |  1 | T_AUTHOR, T_BOOK, T_BOOK_DETAILS,       |
|                      |      |    | T_BOOK_SALE, T_BOOK_STORE,              |
|                      |      |    | T_BOOK_TO_BOOK_STORE, T_LANGUAGE        |
| T_BOOK_DETAILS       |    7 |  1 | T_AUTHOR, T_BOOK, T_BOOK_DETAILS,       |
|                      |      |    | T_BOOK_SALE, T_BOOK_STORE,              |
|                      |      |    | T_BOOK_TO_BOOK_STORE, T_LANGUAGE        |
| T_BOOK_STORE         |    7 |  1 | T_AUTHOR, T_BOOK, T_BOOK_DETAILS,       |
|                      |      |    | T_BOOK_SALE, T_BOOK_STORE,              |
|                      |      |    | T_BOOK_TO_BOOK_STORE, T_LANGUAGE        |
| T_BOOK_TO_BOOK_STORE |    7 |  1 | T_AUTHOR, T_BOOK, T_BOOK_DETAILS,       |
|                      |      |    | T_BOOK_SALE, T_BOOK_STORE,              |
|                      |      |    | T_BOOK_TO_BOOK_STORE, T_LANGUAGE        |
| T_DIRECTORY          |    1 |  2 | T_DIRECTORY                             |
| T_LANGUAGE           |    7 |  1 | T_AUTHOR, T_BOOK, T_BOOK_DETAILS,       |
|                      |      |    | T_BOOK_SALE, T_BOOK_STORE,              |
|                      |      |    | T_BOOK_TO_BOOK_STORE, T_LANGUAGE        |
| X_TEST_CASE_64_69    |    4 |  4 | X_TEST_CASE_64_69, X_TEST_CASE_71,      |
|                      |      |    | X_TEST_CASE_85, X_UNUSED                |
| X_TEST_CASE_71       |    4 |  4 | X_TEST_CASE_64_69, X_TEST_CASE_71,      |
|                      |      |    | X_TEST_CASE_85, X_UNUSED                |
| X_TEST_CASE_85       |    4 |  4 | X_TEST_CASE_64_69, X_TEST_CASE_71,      |
|                      |      |    | X_TEST_CASE_85, X_UNUSED                |
| X_UNUSED             |    4 |  4 | X_TEST_CASE_64_69, X_TEST_CASE_71,      |
|                      |      |    | X_TEST_CASE_85, X_UNUSED                |
+----------------------+------+----+-----------------------------------------+

### Can you beat this? :-)

I challenge you to write a shorter query and to achieve the same result! Here’s the integration test database:
https://github.com/jOOQ/jOOQ/blob/master/jOOQ-test/src/org/jooq/test/oracle/create.sql

Note that the above query is horribly inefficient. There’s a lot of potential in beating that, too!

## Recursive queries with Oracle’s CONNECT BY clause

Recursive or hierarchical queries are an awkward thing in SQL. Some RDBMS allow for recursiveness in Common Table Expressions (CTE’s), but those queries tend to be quite unreadable. That’s not the case for Oracle, which, in addition to recursive CTE’s also supports a dedicated CONNECT BY clause. The general syntax for this clause looks something like this:

--   SELECT ..
--     FROM ..
--    WHERE ..
CONNECT BY [NOCYCLE] condition [AND condition, ...]
-- GROUP BY ..

Iterative queries are very simple to express with jOOQ as well. Just use the connectBy() method as such:

// Some Oracle-specific features are only available
// from the OracleFactory
OracleFactory create = new OracleFactory(connection);

// Get a table with elements 1, 2, 3, 4, 5
create.select(create.rownum())
.connectBy(create.level().lessOrEqual(5))
.fetch();

Recursive queries are simple to express as well. Let’s say, you have a DIRECTORY table with ID, PARENT_ID, NAME columns. In order to recursively fetch all directories and calculate their absolute paths, you could issue a query like this in jOOQ. Note the usage of Oracle’s CONNECT BY, START WITH, and PRIOR keywords, as well as the SYS_CONNECT_BY_PATH function:

OracleFactory ora = new OracleFactory(connection);

List<?> paths =
ora.select(ora.sysConnectByPath(Directory.NAME, "/").substring(2))
.from(Directory)
.connectBy(ora.prior(Directory.ID).equal(Directory.PARENT_ID))
.startWith(Directory.PARENT_ID.isNull())
.orderBy(ora.literal(1))
.fetch(0);

jOOQ’s output could then look like this:

+------------------------------------------------+
|substring                                       |
+------------------------------------------------+
|C:                                              |
|C:/eclipse                                      |
|C:/eclipse/configuration                        |
|C:/eclipse/dropins                              |
|C:/eclipse/eclipse.exe                          |
+------------------------------------------------+
|...21 record(s) truncated...

In the near future, jOOQ is going to project the CONNECT BY syntax and API to other RDBMS’s Common Table Expression syntax. That way, you can express hierarchical queries in any database supporting CTE’s using Oracle’s CONNECT BY syntax.