Beautiful SQL: Lateral Unnesting of Array Columns


Sometimes, SQL can just be so beautiful. One of the less mainstream features in SQL is the array type (or nested collections). In fact, it’s so not mainstream that only 2 major databases actually support it: Oracle and PostgreSQL (and HSQLDB and H2 in the Java ecosystem).

In PostgreSQL, you can write:

CREATE TABLE blogs (
  id    SERIAL NOT NULL PRIMARY KEY,
  title text   NOT NULL,
  tags  text[]
)

Or in Oracle:

-- Oracle only knows nominal array types, so we have to declare
-- them in advance
CREATE TYPE tag_t AS VARRAY(100) OF VARCHAR2(100 CHAR);

CREATE TABLE blogs (
  id    NUMBER(18) GENERATED BY DEFAULT AS IDENTITY 
                   NOT NULL PRIMARY KEY,
  title VARCHAR2(100 CHAR) NOT NULL,
  tags  tag_t
)

So, roughly the same thing. Now, let’s insert some data. How about the 3 most recent posts on the jOOQ blog, prior to this one:

In PostgreSQL:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  ARRAY[
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ]
), (
  'How to Execute SQL Batches With JDBC and jOOQ',
  ARRAY[
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ]
), (
  'How to Emulate Partial Indexes in Oracle',
  ARRAY[
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ]
)

Or in Oracle:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  tag_t(
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Execute SQL Batches With JDBC and jOOQ',
  tag_t(
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Emulate Partial Indexes in Oracle',
  tag_t(
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ));

Now, the array type by itself is not very useful. When it gets really interesting is when we unnest it again into a table. For instance in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) AS tags(tag);

Or in Oracle:

-- Classic style
SELECT title, tags.*
FROM blogs, TABLE(tags) tags;

-- Since Oracle 12c
SELECT title, tags.*
FROM blogs, LATERAL (SELECT * FROM TABLE(tags)) tags;

Note that we’re using the keyword LATERAL in some of the above queries. For those of you who are used to T-SQL syntax, it’s almost the same thing as APPLY. Both LATERAL and APPLY are also very useful with table valued functions (stay tuned for a blog post on those).

The idea behind LATERAL is that the table (derived table, subquery, function call, array unnesting) on the right side of LATERAL can “laterally” access stuff from the left side of LATERAL in order to produce new tables. In the above query, we’re producing a new table of tags for each blog post, and then we cross join the two tables.

Here’s what the above queries result in:

title                                                         tag
-----------------------------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   implicit cursor
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   batch
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   oracle
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jooq
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jdbc
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   resultset
How to Execute SQL Batches With JDBC and jOOQ                 batch
How to Execute SQL Batches With JDBC and jOOQ                 batch statement
How to Execute SQL Batches With JDBC and jOOQ                 mysql
How to Execute SQL Batches With JDBC and jOOQ                 jooq
How to Execute SQL Batches With JDBC and jOOQ                 jdbc
How to Execute SQL Batches With JDBC and jOOQ                 sql server
How to Execute SQL Batches With JDBC and jOOQ                 sql
How to Emulate Partial Indexes in Oracle                      optimisation
How to Emulate Partial Indexes in Oracle                      index
How to Emulate Partial Indexes in Oracle                      partial index
How to Emulate Partial Indexes in Oracle                      oracle
How to Emulate Partial Indexes in Oracle                      sql
How to Emulate Partial Indexes in Oracle                      postgresql
How to Emulate Partial Indexes in Oracle                      t-sql
How to Emulate Partial Indexes in Oracle                      sql server

You can immediately see the cross join semantics here, as we’re combining each tag (per post) with its post.

Looking for ordinals (i.e. the tag number inside of the array) along with the array? Easy!

Just add the powerful WITH ORDINALITY clause after the UNNEST() call in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) WITH ORDINALITY AS tags(tag);

A bit more complicated to emulate in Oracle:

-- Fancy, with a window function
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROW_NUMBER() OVER (ORDER BY NULL)
  FROM TABLE(tags) tags
) tags;

-- Classic, with ROWNUM
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROWNUM
  FROM TABLE(tags) tags
) tags;

The result now contains the tag “ID”, i.e the ordinal of the tag inside of the array:

title                                           tag               ordinal
-------------------------------------------------------------------------
How to Fetch ... Cursors with JDBC and jOOQ     implicit cursor   1
How to Fetch ... Cursors with JDBC and jOOQ     batch             2
How to Fetch ... Cursors with JDBC and jOOQ     oracle            3
How to Fetch ... Cursors with JDBC and jOOQ     jooq              4
How to Fetch ... Cursors with JDBC and jOOQ     jdbc              5
How to Fetch ... Cursors with JDBC and jOOQ     resultset         6
How to Execute SQL Batches With JDBC and jOOQ   batch             1
How to Execute SQL Batches With JDBC and jOOQ   batch statement   2
How to Execute SQL Batches With JDBC and jOOQ   mysql             3
How to Execute SQL Batches With JDBC and jOOQ   jooq              4
How to Execute SQL Batches With JDBC and jOOQ   jdbc              5
How to Execute SQL Batches With JDBC and jOOQ   sql server        6
How to Execute SQL Batches With JDBC and jOOQ   sql               7
How to Emulate Partial Indexes in Oracle        optimisation      1
How to Emulate Partial Indexes in Oracle        index             2
How to Emulate Partial Indexes in Oracle        partial index     3
How to Emulate Partial Indexes in Oracle        oracle            4
How to Emulate Partial Indexes in Oracle        sql               5
How to Emulate Partial Indexes in Oracle        postgresql        6
How to Emulate Partial Indexes in Oracle        t-sql             7
How to Emulate Partial Indexes in Oracle        sql server        8

Now, imagine looking for those blog posts that are tagged “jooq”. Easy!

PostgreSQL:

SELECT title
FROM blogs
WHERE 'jooq' = ANY(tags);

Oracle:

SELECT title
FROM blogs
WHERE 'jooq' IN (SELECT * FROM TABLE(tags));

Yielding:

title
-----------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ
How to Execute SQL Batches With JDBC and jOOQ

Conclusion

These are just a few nice things we can do when we denormalise our data into nested collections / arrays, and then use features like UNNEST to bring them back to the table level. Both Oracle and PostgreSQL support a variety of really nice features building on top of arrays, so do check them out!

How to Emulate Partial Indexes in Oracle


A very interesting feature of the SQL Server and PostgreSQL databases (and some others, including SQLite) is the partial index (sometimes also called “filtered index”). That’s an index that contains only “parts” of the table data. For instance, we can write the following index in SQL Server and PostgreSQL:

CREATE INDEX i ON message WHERE deleted = 1;

Let’s imagine you have a house keeping job that periodically removes deleted messages. Now, let’s assume you have discovered, that only 0.1% of all messages are really deleted, so an index on the DELETED column is very selective if you’re looking for deleted messages.

On the other hand, it’s not selective at all if you’re looking for non-deleted messages, as such a query would return 99.9% of all messages, in case of which a full table scan is more efficient.

So, since the index is never useful for non-deleted messages, why index those messages at all? If we can avoid indexing non-deleted messages, then we can:

  • Save a lot of disk space, as the index will be much smaller
  • Save a lot of time inserting into the messages table, since we don’t have to update the index all the time
  • Save a lot of time scanning the index, since it will contain a lot less blocks

Unfortunately, Oracle doesn’t support this feature

… but “luckily”, Oracle has another controversial “feature”. In Oracle, all indexes are partial indexes, because they don’t contain NULL values. This is probably due to an ancient optimisation (remember, partial indexes are smaller), which occasionally gets into your way, performance wise, if you do want to query for NULL values.

But in this case, it’s useful. Check this out:

CREATE TABLE message(deleted number(1));

CREATE INDEX i ON message (
  CASE WHEN deleted > 0 THEN deleted END
);

The above index is a function-based index, i.e. an index that contains not the value of the deleted column itself, but an expression based on it. Concretely, it contains only deleted values that are strictly greater than zero, because if the value is zero, then it is turned to NULL by the CASE expression, and Oracle doesn’t index NULL values. Check this out:

INSERT INTO message
SELECT DECODE(LEVEL, 1, 1, 0)
FROM dual
CONNECT BY LEVEL < 100000;

The above query is inserting a single row containing a deleted value of 1, and almost 100k rows containing a value of 0. The insert is very quick, because only one row has to be added to the index. The other almost 100k rows are skipped:

EXEC dbms_stats.gather_table_stats('SANDBOX', 'MESSAGE');

SELECT NUM_ROWS, BLOCKS
FROM SYS.ALL_TAB_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';

SELECT NUM_ROWS, LEAF_BLOCKS
FROM SYS.ALL_IND_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';

The result is:

NUM_ROWS       BLOCKS
---------------------
   99999          152 <-- table size

NUM_ROWS  LEAF_BLOCKS
---------------------
       1            1 <-- index size

The “trouble” with this kind of emulation is: It’s a function-based index. We can use this index only if we really reproduce the same “function” (or in this case, expression) as in the index itself. So, in order to fetch all the deleted messages, we must not write the following query:

SELECT *
FROM message
WHERE deleted = 1;

But this one, instead:

SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;

Check out execution plans:

EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE deleted = 1;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;

SELECT * FROM TABLE(dbms_xplan.display);

The output being:

------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         | 50000 |   146K|    44   (3)|
|*  1 |  TABLE ACCESS FULL| MESSAGE | 50000 |   146K|    44   (3)|
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("DELETED"=1)

And

----------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     1 |     3 |     2   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID| MESSAGE |     1 |     3 |     2   (0)|
|*  2 |   INDEX RANGE SCAN          | I       |     1 |       |     1   (0)|
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access(CASE  WHEN "DELETED">0 THEN "DELETED" END =1)

As you can see, the first query runs a full table scan, estimating to retrieve 50% of all the rows, when the actual result is only 1 row as can be seen in the second execution plan!

Insertion speed

What’s even more impressive is the difference in insertion speed. Consider the following code block, which measures the time it takes to insert 1 million times 0 and one million times 1:

SET SERVEROUTPUT ON
DECLARE
  ts TIMESTAMP;
BEGIN
  ts := SYSTIMESTAMP;
  INSERT INTO message 
  SELECT 0 FROM dual CONNECT BY level <= 1000000;
  dbms_output.put_line(SYSTIMESTAMP - ts);
  
  ts := SYSTIMESTAMP;
  INSERT INTO message 
  SELECT 1 FROM dual CONNECT BY level <= 1000000;
  dbms_output.put_line(SYSTIMESTAMP - ts);
END;
/

The result being:

+000000000 00:00:01.501000000
+000000000 00:00:08.162000000

The insertion is much faster if we don’t have to modify the index!

Conclusion

Partial indexes are a very neat trick in cases where your data is highly skewed and some values in a column are extremely rare and very frequently queried. They may drastically reduce the index size, which greatly improves performance in some situations, including inserting into the table, and querying the index.

In Oracle, they can be emulated using function-based indexes, which means you have to use the exact function expression from the index also in queries, in order to profit. But it may well be worth the trouble!

Avoid Using COUNT() in SQL When You Could Use EXISTS()


A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:
https://blog.jooq.org/2014/08/08/sql-tip-of-the-day-be-wary-of-select-count

… and how to replace them with equivalent EXISTS queries

exist

As I’m updating the jOOQ SQL Masterclass to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

The rationale is simple

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:

  “Are there any rows at all?”

In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:

  “Did actors called “Wahlberg” play in any films at all?”

Then you have two options to write that query:

Very very bad: Use COUNT(*)

Using PostgreSQL syntax:

SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.

Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with COUNT(*)

And in PostgreSQL:

mqlgukh1

Much much better: Use EXISTS()

Using PostgreSQL syntax:

SELECT EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
)

The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with EXISTS()

And in PostgreSQL:

plwnqga1

How to read this?

As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.

In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.

You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:

It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence

Duh.

Does this apply to me?

Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?

Collection<?> collection = ...

// EXISTS
if (!collection.isEmpty())
    doSomething();

// COUNT(*)
if (collection.size() == 0)
    doSomething();

Sometimes, this doesn’t really matter, e.g. in ArrayList, whose isEmpty() method reads:

public boolean isEmpty() {
    return size == 0;
}

But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.

Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™

OK, costs. But what does it mean?

Let’s benchmark these two statements in Oracle and PostgreSQL.

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

We get a slight but significant performance improvement of factor 1.3x:

Statement 1 : 3
Statement 2 : 4

(not actual times, because thank you Oracle legal for prohibiting all sorts of stuff). But you can check out the Sakila database yourself and run the above benchmark on your machine.

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

A whopping factor 40x in terms of wallclock time gain!

INFO:  Statement 1: 00:00:00.023656
INFO:  Statement 2: 00:00:00.7944

Let me repeat this:

Factor 40x on PostgreSQL

That’s something! It looks as though COUNT(*) is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.

Conclusion

I’m repeating myself, but this is important. Print it out and put it on your office wall:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

Thank you.

The 10 Most Popular DB Engines (SQL and NoSQL) in 2015


About two years ago, we’ve published this post about the 10 most popular DB engines, where we analyzed the data published by Solid IT on their DB Ranking website.

In the meantime, the Solid IT measurement system has found to be a credible source, such that the website has also been cited at Gartner, InfoWorld, and many other sources. For more details about how this is measured, check out the relevant website on db-engines.com:
http://db-engines.com/en/ranking_definition

Comparing the top 10 list, we can see that the players have shifted, heavily:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

The top 3 elefants are still Oracle, MySQL and Microsoft SQL Server, but the runner-ups have changed. While PostgreSQL is still gaining traction, it has lost grounds compared to MongoDB.

Also, Cassandra and Redis have pushed out Sybase and Teradata from the top 10!

When 2 years ago, there had been only a single non RDBMS in this top 10 list, there are now 3, all of which “schema-less”, and they’re gaining additional momentum.

Clearly, vendors of RDBMS will need to move quickly to accommodate the needs of document storage and key-value storage the way their new competitors do.

For us on the jOOQ blog, a much more interesting perspective is to see where our supported databases currently are in this ranking:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

With the recent release of jOOQ 3.7, we’ve added another three databases to our list of now 21 supported RDBMS. Compared to last year’s ranking, almost all of these RDBMS are gaining traction as well, apart from SQL Server.

One thing is certain: We still have very exciting times ahead. Stay tuned for more updates, and check out the current ranking here:

http://db-engines.com/en/ranking

Implementing Client-Side Row-Level Security with jOOQ


Some time ago, we’ve promised to follow up on our Constraints on Views article with a sequel showing how to implement client-side row-level security with jOOQ.

What is row-level security?

Some databases like Oracle or the upcoming PostgreSQL 9.5 provide native support for row-level security, which is awesome – but not every database has this feature.

Row level security essentially means that a given database session can access only some rows in the database, but not others. For instance, this is what John can see in his session:

id  account_owner  account_name             amount
--------------------------------------------------
1   John           savings                  500.00
2   Jane           savings                 1300.00
3   John           secret poker stash  85193065.00

In the above example, assume that John and Jane are a married couple with access to each others’ bank accounts, except that John wants to hide his secret poker stash from Jane because he’s planning on running away with Jill to the Bahamas living the good life. So, the above would model John’s view on the data set, whereas the below would model Jane’s:

id  account_owner  account_name             amount
--------------------------------------------------
1   John           savings                  500.00
2   Jane           savings                 1300.00

The nice thing about row-level security is the fact that the data is completely hidden from a database session, as if it weren’t even there. This includes aggregations or filtering, as can be seen in the below examples of John’s view:

id  account_owner  account_name             amount
--------------------------------------------------
1   John           savings                  500.00
2   Jane           savings                 1300.00
3   John           secret poker stash  85193065.00
--------------------------------------------------
Total John+Jane                        85194865.00

vs. Jane’s view:

id  account_owner  account_name             amount
--------------------------------------------------
1   John           savings                  500.00
2   Jane           savings                 1300.00
--------------------------------------------------
Total John+Jane                            1800.00

If your database doesn’t support row-level security, you will need to emulate it somehow. One way to do that is by using views, and possibly, temporary tables or some context variables:

CREATE VIEW my_accounts AS
SELECT
  a.id,
  a.account_owner,
  a.account_name,
  a.amount
FROM
  accounts a
JOIN
  account_privileges p
ON
  a.id = p.a_id
WHERE
  p.privilege_owner = SYS_CONTEXT('banking', 'user');

In the above example, I’m using Oracle’s SYS_CONTEXT function, which allows accessing global context from the current session. This context could be initialised every time a JDBC Connection is fetched from the connection pool, for instance.

What if your database doesn’t support these things?

jOOQ to the rescue! Since jOOQ 3.2, a VisitListener Service Provider Interface (SPI) has been introduced for precisely this reason, which allows jOOQ users to perform SQL AST transformations while the SQL statement is being generated.

We’re assuming:

CREATE TABLE accounts (
  id BIGINT NOT NULL PRIMARY KEY,
  account_owner VARCHAR(20) NOT NULL,
  account_name VARCHAR(20) NOT NULL,
  amount DECIMAL(18, 2) NOT NULL
);

CREATE TABLE transactions (
  id BIGINT NOT NULL PRIMARY KEY,
  account_id BIGINT NOT NULL,
  amount DECIMAL(18, 2) NOT NULL
);

INSERT INTO accounts (
  account_owner, account_name, amount
)
VALUES (1, 'John', 'savings', 500.0),
       (2, 'Jane', 'savings', 1300.0),
       (3, 'John', 'secret poker stash', 85193065.00);

INSERT INTO transactions (
  id, account_id, amount
)
VALUES (1, 1, 200.0),
       (2, 1, 300.0),
       (3, 2, 300.0),
       (4, 2, 800.0),
       (5, 2, 200.0),
       (6, 3, 85193065.00);

Adding a simple WHERE clause

This is the simplest transformation use-case:

-- turn this...
SELECT 
  accounts.account_name, 
  accounts.amount
FROM 
  accounts

-- ... into this
SELECT 
  accounts.account_name, 
  accounts.amount
FROM 
  accounts
WHERE 
  accounts.id IN (?, ?) -- Predicate, for simplicity

Adding an AND clause to an existing WHERE clause

The transformation should still work, if there is already an existing predicate

-- turn this...
SELECT 
  accounts.account_name, 
  accounts.amount
FROM 
  accounts
WHERE 
  accounts.account_owner = 'John'

-- ... into this
SELECT 
  accounts.account_name, 
  accounts.amount
FROM 
  accounts
WHERE 
  accounts.account_owner = 'John'
AND 
  accounts.id IN (?, ?)

Adding the predicate also for aliased tables

When doing self-joins or for whatever other reason you might have applied an alias

-- turn this...
SELECT 
  a.account_name, 
  a.amount
FROM 
  accounts a

-- ... into this
SELECT 
  a.account_name, 
  a.amount
FROM 
  accounts a
WHERE 
  a.id IN (?, ?)

Adding the predicate also in subqueries / joins / semi-joins

Of course, we shouldn’t restrict ourselves to patching top-level SELECT statements. The following transformation must be applied as well:

-- turn this...
SELECT 
  t.amount,
  t.balance
FROM 
  transactions t
WHERE
  t.account_id IN (
    SELECT 
      a.id 
    FROM 
      accounts a
    WHERE 
      a.account_owner = 'John'
  )

-- ... into this
SELECT 
  t.amount,
  t.balance
FROM 
  transactions t
WHERE
  t.account_id IN (
    SELECT 
      a.id 
    FROM 
      accounts a
    WHERE 
      a.account_owner = 'John'
    AND
      a.id IN (?, ?)
  )

Adding the predicate to foreign key references

This might be easily forgotten, but in fact, we also want to add an additional predicate to all foreign key references of accounts.id, namely transactions.account_id, especially when the transactions table is not joined to the accounts table:

-- turn this...
SELECT 
  t.amount,
  t.balance
FROM 
  transactions t

-- ... into this
SELECT 
  t.amount,
  t.balance
FROM 
  transactions t
WHERE
  t.account_id IN (?, ?)

Long story short, we would like to find all queries that refer to the accounts table in some form, and add a simple predicate to it, implementing access control.

DISCLAIMER: As always with security, you should implement security on several layers. SQL AST transformation is not trivial, and the above scenarios are incomplete. Besides, they work only for queries that are built using the jOOQ AST, not for plain SQL queries, or for queries that are run via JDBC directly, via Hibernate, or via stored procedures, views (which in turn refer to the accounts table), or simple table synonyms.

So, read this post as a tutorial showing how to perform AST transformation, not as a complete solution to row-level security

How to do it with jOOQ?

Now comes the interesting part. We’re going to do the whole thing with jOOQ. First off, remember that in order to implement backwards-compatible SPI evolution, we always provide a default implementation for our SPIs. In this case, we’re going to extend DefaultVisitListener instead of implementing VisitListener directly.

The base of our VisitListener will be the following:

public class AccountIDFilter extends DefaultVisitListener {

    final Integer[] ids;

    public AccountIDFilter(Integer... ids) {
        this.ids = ids;
    }
}

In other words, a filtering listener that filters for a given set of IDs to be put in an IN predicate.

Now, first off, we’ll need a bit of utility methods. The following two utilities push or pop some objects on top of a stack:

void push(VisitContext context) {
    conditionStack(context).push(new ArrayList<>());
    whereStack(context).push(false);
}

void pop(VisitContext context) {
    whereStack(context).pop();
    conditionStack(context).pop();
}

… and the stack can be seen here:

Deque<List<Condition>> conditionStack(
        VisitContext context) {
    Deque<List<Condition>> data = (Deque<List<Condition>>) 
        context.data("conditions");

    if (data == null) {
        data = new ArrayDeque<>();
        context.data("conditions", data);
    }

    return data;
}

Deque<Boolean> whereStack(VisitContext context) {
    Deque<Boolean> data = (Deque<Boolean>) 
        context.data("predicates");

    if (data == null) {
        data = new ArrayDeque<>();
        context.data("predicates", data);
    }

    return data;
}

In prose, the conditionStack maintains a stack of conditions for each subquery, whereas the whereStack maintains a stack of flags for each subquery. The conditions are the conditions that should be generated in the WHERE clause of the given subquery, whereas the flags indicate whether a WHERE clause is already present (i.e. whether new conditions should be appended using AND, rather than WHERE).

For convenience, we’ll also add the following utilities:

List<Condition> conditions(VisitContext context) {
    return conditionStack(context).peek();
}

boolean where(VisitContext context) {
    return whereStack(context).peek();
}

void where(VisitContext context, boolean value) {
    whereStack(context).pop();
    whereStack(context).push(value);
}

These utilities allow for accessing both the conditions and flags at the top of the stack (in the current subquery), as well as for replacing the flag at the top of the stack.

Why do we need a stack?

It’s simple. We want to apply the predicate only locally for the current subquery, while jOOQ transforms and generates your SQL statement. Remember, when we were transforming the following:

SELECT 
  t.amount,
  t.balance
FROM 
  transactions t
WHERE
  t.account_id IN (
    SELECT 
      a.id 
    FROM 
      accounts a
    WHERE 
      a.account_owner = 'John'
    AND
      a.id IN (?, ?)
  )
AND
  t.account_id IN (?, ?)

… in the end, we want two additional predicates generated in the above query. One in the subquery selecting from accounts, and another one in the top-level query selecting from transactions, but the two predicates shouldn’t interfere with each other, i.e. when jOOQ generates the subquery, we only want to see objects that are relevant to the subquery (top of the stack).

So, let’s see how and when we push stuff on the stack. First off, we need to listen for start and end events of SQL clauses:

starting a SQL clause

This is straight-forward. Every time we enter a new SQL statement, we want to push a new set of data (conditions, flags) on the stack. In a way, we’re creating a local scope for the subquery:

@Override
public void clauseStart(VisitContext context) {

    // Enter a new SELECT clause / nested select
    // or DML statement
    if (context.clause() == SELECT ||
        context.clause() == UPDATE ||
        context.clause() == DELETE ||
        context.clause() == INSERT) {
        push(context);
    }
}

Of course, this scope has to be cleaned up at the end of the same clause:

@Override
public void clauseEnd(VisitContext context) {
    // [ ... more code will follow ... ]

    // Leave a SELECT clause / nested select
    // or DML statement
    if (context.clause() == SELECT ||
        context.clause() == UPDATE ||
        context.clause() == DELETE ||
        context.clause() == INSERT) {
        pop(context);
    }
}

This was the easy part. Now, it gets a bit more interesting. When we end a clause, and that clause is a WHERE clause of a SELECT, UPDATE, or DELETE statement, then we want to render an additional keyword and predicate:

@Override
public void clauseEnd(VisitContext context) {

    // Append all collected predicates to the WHERE
    // clause if any
    if (context.clause() == SELECT_WHERE ||
        context.clause() == UPDATE_WHERE ||
        context.clause() == DELETE_WHERE) {
        List<Condition> conditions = 
            conditions(context);

        if (conditions.size() > 0) {
            context.context()
                   .formatSeparator()
                   .keyword(where(context) 
                   ? "and" 
                   : "where"
                   )
                   .sql(' ');

            context.context().visit(
                DSL.condition(Operator.AND, conditions)
            );
        }
    }

    // [ ... code from previous snippet ... ]

So, the only thing we still need to do is assemble that List<Condition> on the stack of the current query, such that we can append it to the WHERE clause, as well as the flag that decides between "and" and "where". This can be done by overriding visitEnd(), a method that is invoked at the end of visiting a QueryPart (as opposed to a Clause):

@Override
public void visitEnd(VisitContext context) {

    // We'll see what this means in a bit...
    pushConditions(context, ACCOUNTS, 
        ACCOUNTS.ID, ids);
    pushConditions(context, TRANSACTIONS, 
        TRANSACTIONS.ACCOUNT_ID, ids);

    // Check if we're rendering any condition within
    // the WHERE clause In this case, we can be sure
    // that jOOQ will render a WHERE keyword
    if (context.queryPart() instanceof Condition) {
        List<Clause> clauses = clauses(context);

        if (clauses.contains(SELECT_WHERE) ||
            clauses.contains(UPDATE_WHERE) ||
            clauses.contains(DELETE_WHERE)) {
            where(context, true);
        }
    }
}

List<Clause> clauses(VisitContext context) {
    List<Clause> result = asList(context.clauses());
    int index = result.lastIndexOf(SELECT);

    if (index > 0)
        return result.subList(index, result.size() - 1);
    else
        return result;
}

At the end of each query part…

  • we’re trying to push the relevant conditions on the stack if applicable
  • we’re checking to see if a WHERE clause is present and set the relevant flag

So, finally, the core logic contained in pushConditions() is the only thing missing, and that’s:

<E> void pushConditions(
        VisitContext context, 
        Table<?> table, 
        Field<E> field, 
        E... values) {

    // Check if we're visiting the given table
    if (context.queryPart() == table) {
        List<Clause> clauses = clauses(context);

        // ... and if we're in the context of the current
        //  subselect's FROM clause
        if (clauses.contains(SELECT_FROM) ||
            clauses.contains(UPDATE_UPDATE) ||
            clauses.contains(DELETE_DELETE)) {

            // If we're declaring a TABLE_ALIAS...
            // (e.g. "ACCOUNTS" as "a")
            if (clauses.contains(TABLE_ALIAS)) {
                QueryPart[] parts = context.queryParts();

                // ... move up the QueryPart visit path to find the
                // defining aliased table, and extract the aliased
                // field from it. (i.e. the "a" reference)
                for (int i = parts.length - 2; i >= 0; i--) {
                    if (parts[i] instanceof Table) {
                        field = ((Table<?>) parts[i]).field(field);
                        break;
                    }
                }
            }

            // Push a condition for the field of the
            // (potentially aliased) table
            conditions(context).add(field.in(values));
        }
    }
}

And we’re done! Whew – well, that certainly wasn’t too easy. AST transformation never is. But the above algorithm is robust and can be used to run all of the aforementioned queries.

Testing the above

Configuration configuration = create().configuration();

// This configuration has full access to all rows
DSLContext fullaccess = DSL.using(configuration);

// This configuration has restricted access to IDs 1 and 2
DSLContext restricted = DSL.using(
    configuration.derive(
        DefaultVisitListenerProvider.providers(
            new AccountIDFilter(1, 2)
        )
    )
);

// Fetching accounts
assertEquals(3, fullaccess.fetch(ACCOUNTS).size());
assertEquals(2, restricted.fetch(ACCOUNTS).size());

Generating the following SQL:

select 
  "PUBLIC"."ACCOUNTS"."ID", 
  "PUBLIC"."ACCOUNTS"."ACCOUNT_OWNER", 
  "PUBLIC"."ACCOUNTS"."ACCOUNT_NAME", 
  "PUBLIC"."ACCOUNTS"."AMOUNT"
from "PUBLIC"."ACCOUNTS"
---------------------------------------
select 
  "PUBLIC"."ACCOUNTS"."ID", 
  "PUBLIC"."ACCOUNTS"."ACCOUNT_OWNER", 
  "PUBLIC"."ACCOUNTS"."ACCOUNT_NAME", 
  "PUBLIC"."ACCOUNTS"."AMOUNT"
from "PUBLIC"."ACCOUNTS"
where "PUBLIC"."ACCOUNTS"."ID" in (
  1, 2
)
// Fetching transactions
assertEquals(6, fullaccess.fetch(TRANSACTIONS).size());
assertEquals(5, restricted.fetch(TRANSACTIONS).size());

Generating the following SQL:

select 
  "PUBLIC"."TRANSACTIONS"."ID", 
  "PUBLIC"."TRANSACTIONS"."ACCOUNT_ID", 
  "PUBLIC"."TRANSACTIONS"."AMOUNT"
from "PUBLIC"."TRANSACTIONS"
---------------------------------------
select 
  "PUBLIC"."TRANSACTIONS"."ID", 
  "PUBLIC"."TRANSACTIONS"."ACCOUNT_ID", 
  "PUBLIC"."TRANSACTIONS"."AMOUNT"
from "PUBLIC"."TRANSACTIONS"
where "PUBLIC"."TRANSACTIONS"."ACCOUNT_ID" in (
  1, 2
)
// Fetching John's accounts
assertEquals(asList(1, 3), fullaccess.fetchValues(
    select(ACCOUNTS.ID)
    .from(ACCOUNTS)
    .where(ACCOUNTS.ACCOUNT_OWNER.eq("John"))
    .orderBy(1)
));
assertEquals(asList(1   ), restricted.fetchValues(
    select(ACCOUNTS.ID)
    .from(ACCOUNTS)
    .where(ACCOUNTS.ACCOUNT_OWNER.eq("John"))
    .orderBy(1)
));

Generating the following SQL:

select "PUBLIC"."ACCOUNTS"."ID"
from "PUBLIC"."ACCOUNTS"
where "PUBLIC"."ACCOUNTS"."ACCOUNT_OWNER" = 'John'
order by 1 asc
---------------------------------------
select "PUBLIC"."ACCOUNTS"."ID"
from "PUBLIC"."ACCOUNTS"
where "PUBLIC"."ACCOUNTS"."ACCOUNT_OWNER" = 'John'
and "PUBLIC"."ACCOUNTS"."ID" in (
  1, 2
)
order by 1 asc
// Fetching John's accounts via an aliased table
Accounts a = ACCOUNTS.as("a");
assertEquals(asList(1, 3), fullaccess.fetchValues(
    select(a.ID)
    .from(a)
    .where(a.ACCOUNT_OWNER.eq("John"))
    .orderBy(1)
));
assertEquals(asList(1   ), restricted.fetchValues(
    select(a.ID)
    .from(a)
    .where(a.ACCOUNT_OWNER.eq("John"))
    .orderBy(1)
));

Generating the following SQL:

select "a"."ID"
from "PUBLIC"."ACCOUNTS" "a"
where "a"."ACCOUNT_OWNER" = 'John'
order by 1 asc
---------------------------------------
select "a"."ID"
from "PUBLIC"."ACCOUNTS" "a"
where "a"."ACCOUNT_OWNER" = 'John'
and "a"."ID" in (
  1, 2
)
order by 1 asc
// Fetching John's transactions
Transactions t = TRANSACTIONS.as("t");
assertEquals(asList(1, 2, 6), fullaccess.fetchValues(
    select(t.ID)
    .from(t)
    .where(t.ACCOUNT_ID.in(
        select(a.ID)
        .from(a)
        .where(a.ACCOUNT_OWNER.eq("John"))
    ))
    .orderBy(1)
));
assertEquals(asList(1, 2   ), restricted.fetchValues(
    select(t.ID)
    .from(t)
    .where(t.ACCOUNT_ID.in(
        select(a.ID)
        .from(a)
        .where(a.ACCOUNT_OWNER.eq("John"))
    ))
    .orderBy(1)
));

Generating the following SQL:

select "t"."ID"
from "PUBLIC"."TRANSACTIONS" "t"
where "t"."ACCOUNT_ID" in (
  select "a"."ID"
  from "PUBLIC"."ACCOUNTS" "a"
  where "a"."ACCOUNT_OWNER" = 'John'
)
order by 1 asc
---------------------------------------
select "t"."ID"
from "PUBLIC"."TRANSACTIONS" "t"
where "t"."ACCOUNT_ID" in (
  select "a"."ID"
  from "PUBLIC"."ACCOUNTS" "a"
  where "a"."ACCOUNT_OWNER" = 'John'
  and "a"."ID" in (
    1, 2
  )
)
and "t"."ACCOUNT_ID" in (
  1, 2
)
order by 1 asc

Conclusion

The above examples have shown how row-level security can be implemented relatively easily using jOOQ’s VisitListener, a very powerful Service Provider Interface for client-side SQL AST transformation.

The applications don’t stop there. It is easy to see how you can implement a VisitListener that throws an exception every time you run a DML query that does not have a WHERE clause.

where-clause

Another application would be to replace a table by a similar table, whenever a certain condition is true.

And the best thing is: With jOOQ, you don’t need to parse SQL in order to transform it (which is extremely hard, depending on the SQL dialect). You already manually build an Abstract Syntax Tree using jOOQ’s fluent API, so you get all these features for free.

So: Happy SQL transformation!

jOOQ: The best way to write SQL in Java

Appendix: Full code

import static java.util.Arrays.asList;
import static org.jooq.Clause.DELETE;
import static org.jooq.Clause.DELETE_DELETE;
import static org.jooq.Clause.DELETE_WHERE;
import static org.jooq.Clause.INSERT;
import static org.jooq.Clause.SELECT;
import static org.jooq.Clause.SELECT_FROM;
import static org.jooq.Clause.SELECT_WHERE;
import static org.jooq.Clause.TABLE_ALIAS;
import static org.jooq.Clause.UPDATE;
import static org.jooq.Clause.UPDATE_UPDATE;
import static org.jooq.Clause.UPDATE_WHERE;
import static org.jooq.test.h2.generatedclasses.Tables.ACCOUNTS;
import static org.jooq.test.h2.generatedclasses.Tables.TRANSACTIONS;

import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Deque;
import java.util.List;

import org.jooq.Clause;
import org.jooq.Condition;
import org.jooq.Field;
import org.jooq.Operator;
import org.jooq.QueryPart;
import org.jooq.Table;
import org.jooq.VisitContext;
import org.jooq.impl.DSL;
import org.jooq.impl.DefaultVisitListener;

@SuppressWarnings("unchecked")
public class AccountIDFilter extends DefaultVisitListener {

    final Integer[] ids;

    public AccountIDFilter(Integer... ids) {
        this.ids = ids;
    }

    void push(VisitContext context) {
        conditionStack(context).push(new ArrayList<>());
        whereStack(context).push(false);
    }

    void pop(VisitContext context) {
        whereStack(context).pop();
        conditionStack(context).pop();
    }

    Deque<List<Condition>> conditionStack(VisitContext context) {
        Deque<List<Condition>> data = (Deque<List<Condition>>) context.data("conditions");

        if (data == null) {
            data = new ArrayDeque<>();
            context.data("conditions", data);
        }

        return data;
    }

    Deque<Boolean> whereStack(VisitContext context) {
        Deque<Boolean> data = (Deque<Boolean>) context.data("predicates");

        if (data == null) {
            data = new ArrayDeque<>();
            context.data("predicates", data);
        }

        return data;
    }

    List<Condition> conditions(VisitContext context) {
        return conditionStack(context).peek();
    }

    boolean where(VisitContext context) {
        return whereStack(context).peek();
    }

    void where(VisitContext context, boolean value) {
        whereStack(context).pop();
        whereStack(context).push(value);
    }

    <E> void pushConditions(VisitContext context, Table<?> table, Field<E> field, E... values) {

        // Check if we're visiting the given table
        if (context.queryPart() == table) {
            List<Clause> clauses = clauses(context);

            // ... and if we're in the context of the current subselect's
            // FROM clause
            if (clauses.contains(SELECT_FROM) ||
                clauses.contains(UPDATE_UPDATE) ||
                clauses.contains(DELETE_DELETE)) {

                // If we're declaring a TABLE_ALIAS... (e.g. "T_BOOK" as "b")
                if (clauses.contains(TABLE_ALIAS)) {
                    QueryPart[] parts = context.queryParts();

                    // ... move up the QueryPart visit path to find the
                    // defining aliased table, and extract the aliased
                    // field from it. (i.e. the "b" reference)
                    for (int i = parts.length - 2; i >= 0; i--) {
                        if (parts[i] instanceof Table) {
                            field = ((Table<?>) parts[i]).field(field);
                            break;
                        }
                    }
                }

                // Push a condition for the field of the (potentially aliased) table
                conditions(context).add(field.in(values));
            }
        }
    }

    /**
     * Retrieve all clauses for the current subselect level, starting with
     * the last {@link Clause#SELECT}.
     */
    List<Clause> clauses(VisitContext context) {
        List<Clause> result = asList(context.clauses());
        int index = result.lastIndexOf(SELECT);

        if (index > 0)
            return result.subList(index, result.size() - 1);
        else
            return result;
    }

    @Override
    public void clauseStart(VisitContext context) {

        // Enter a new SELECT clause / nested select, or DML statement
        if (context.clause() == SELECT ||
            context.clause() == UPDATE ||
            context.clause() == DELETE ||
            context.clause() == INSERT) {
            push(context);
        }
    }

    @Override
    public void clauseEnd(VisitContext context) {

        // Append all collected predicates to the WHERE clause if any
        if (context.clause() == SELECT_WHERE ||
            context.clause() == UPDATE_WHERE ||
            context.clause() == DELETE_WHERE) {
            List<Condition> conditions = conditions(context);

            if (conditions.size() > 0) {
                context.context()
                       .formatSeparator()
                       .keyword(where(context) ? "and" : "where")
                       .sql(' ');

                context.context().visit(DSL.condition(Operator.AND, conditions));
            }
        }

        // Leave a SELECT clause / nested select, or DML statement
        if (context.clause() == SELECT ||
            context.clause() == UPDATE ||
            context.clause() == DELETE ||
            context.clause() == INSERT) {
            pop(context);
        }
    }

    @Override
    public void visitEnd(VisitContext context) {
        pushConditions(context, ACCOUNTS, ACCOUNTS.ID, ids);
        pushConditions(context, TRANSACTIONS, TRANSACTIONS.ACCOUNT_ID, ids);

        // Check if we're rendering any condition within the WHERE clause
        // In this case, we can be sure that jOOQ will render a WHERE keyword
        if (context.queryPart() instanceof Condition) {
            List<Clause> clauses = clauses(context);

            if (clauses.contains(SELECT_WHERE) ||
                clauses.contains(UPDATE_WHERE) ||
                clauses.contains(DELETE_WHERE)) {
                where(context, true);
            }
        }
    }
}

Use this Neat Window Function Trick to Calculate Time Differences in a Time Series


Whenever you feel that itch…

Can’t I calculate this with SQL?

The answer is: Yes you can! And you should! Let’s see how…

Calculating time differences between rows

Let’s consider the following database containing timestamps (e.g. in a log database). We’re using PostgreSQL syntax for this:

CREATE TABLE timestamps (
  ts timestamp
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0'),
  ('2015-05-01 12:15:24.0'),
  ('2015-05-01 12:15:27.0'),
  ('2015-05-01 12:15:31.0'),
  ('2015-05-01 12:15:40.0'),
  ('2015-05-01 12:15:55.0'),
  ('2015-05-01 12:16:01.0'),
  ('2015-05-01 12:16:03.0'),
  ('2015-05-01 12:16:04.0'),
  ('2015-05-01 12:16:04.0');

Obviously, you’ll be adding constraints and indexes, etc. Now, let’s assume that each individual timestamp represents an event in your system, and you’d like to keep track of how long ago the previous event has happened. I.e. you’d like the following result:

ts                   delta
-------------------------------
2015-05-01 12:15:23
2015-05-01 12:15:24  00:00:01
2015-05-01 12:15:27  00:00:03
2015-05-01 12:15:31  00:00:04
2015-05-01 12:15:40  00:00:09
2015-05-01 12:15:55  00:00:15
2015-05-01 12:16:01  00:00:06
2015-05-01 12:16:03  00:00:02
2015-05-01 12:16:04  00:00:01
2015-05-01 12:16:04  00:00:00

In other words

  • ts1 (12:15:23) + delta (00:00:01) = ts2 (12:15:24)
  • ts2 (12:15:24) + delta (00:00:03) = ts3 (12:15:27)

This can be achieved very easily with the LAG() window function:

SELECT 
  ts, 
  ts - lag(ts, 1) OVER (ORDER BY ts) delta
FROM timestamps
ORDER BY ts;

The above reads simply:

Give me the difference between the ts value of the current row and the ts value of the row that “lags” behind this row by one, with rows ordered by ts.

Easy, right? With LAG() you can actually access any row from another row within a “sliding window” by simply specifying the lag index.

We’ve already described this wonderful window function in a previous blog post.

Bonus: A running total interval

In addition to the difference between this timestamp and the previous one, we might be interested in the total difference between this timestamp and the first timestamp. This may sound like a running total (see our previous article about running totals using SQL), but it can be calculated much more easily using FIRST_VALUE() – a “cousin” of LAG()

SELECT 
  ts, 
  ts - lag(ts, 1) OVER w delta,
  ts - first_value(ts) OVER w total
FROM timestamps
WINDOW w AS (ORDER BY ts)
ORDER BY ts;

… the above query then yields

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:04
2015-05-01 12:15:31  00:00:04  00:00:08
2015-05-01 12:15:40  00:00:09  00:00:17
2015-05-01 12:15:55  00:00:15  00:00:32
2015-05-01 12:16:01  00:00:06  00:00:38
2015-05-01 12:16:03  00:00:02  00:00:40
2015-05-01 12:16:04  00:00:01  00:00:41
2015-05-01 12:16:04  00:00:00  00:00:41

Extra bonus: The total since a “reset” event

We can take this as far as we want. Let’s assume that we want to reset the total from time to time:

CREATE TABLE timestamps (
  ts timestamp,
  event varchar(50)
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0', null),
  ('2015-05-01 12:15:24.0', null),
  ('2015-05-01 12:15:27.0', 'reset'),
  ('2015-05-01 12:15:31.0', null),
  ('2015-05-01 12:15:40.0', null),
  ('2015-05-01 12:15:55.0', 'reset'),
  ('2015-05-01 12:16:01.0', null),
  ('2015-05-01 12:16:03.0', null),
  ('2015-05-01 12:16:04.0', null),
  ('2015-05-01 12:16:04.0', null);

We can now run the following query:

SELECT
  ts, 
  ts - lag(ts, 1) 
       OVER (ORDER BY ts) delta,
  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total
FROM (
  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps
) timestamps
ORDER BY ts;

… to produce

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:00 <-- reset
2015-05-01 12:15:31  00:00:04  00:00:04
2015-05-01 12:15:40  00:00:09  00:00:13
2015-05-01 12:15:55  00:00:15  00:00:00 <-- reset
2015-05-01 12:16:01  00:00:06  00:00:06
2015-05-01 12:16:03  00:00:02  00:00:08
2015-05-01 12:16:04  00:00:01  00:00:09
2015-05-01 12:16:04  00:00:00  00:00:09

The beautiful part is in the derived table

  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps

This derived table just adds the “partition” to each set of timestamps given the most recent “reset” event. The result of the above subquery is:

c  ts
----------------------
0  2015-05-01 12:15:23
0  2015-05-01 12:15:24
1  2015-05-01 12:15:27 <-- reset
1  2015-05-01 12:15:31
1  2015-05-01 12:15:40
2  2015-05-01 12:15:55 <-- reset
2  2015-05-01 12:16:01
2  2015-05-01 12:16:03
2  2015-05-01 12:16:04
2  2015-05-01 12:16:04

As you can see, the COUNT(*) window function counts all the previous “reset” events, ordered by timestamp. This information can then be used as the PARTITION for the FIRST_VALUE() window function in order to find the first timestamp in each partition, i.e. at the time of the most recent “reset” event:

  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total

Conclusion

It’s almost a running gag on this blog to say that…

There was SQL before window functions and SQL after window functions

Window functions are extremely powerful and they’re a part of the SQL standard, supported in most commercial databases, in PostgreSQL, in Firebird 3.0, and in CUBRID. If you aren’t using them already, start using them today!

If you’ve liked this article, find out more about window functions in any of the following articles:

PostgreSQL’s Best-Kept Secret, and how to Use it with jOOQ


PostgreSQL has a lot of secret data types. In recent times, PostgreSQL’s JSON and JSONB support was hyped as being the NoSQL on SQL secret (e.g. as advertised by ToroDB) that allows you to get the best out of both worlds. But there are many other useful data types, among which the range type.

How does the range type work?

Ranges are very useful for things like age, date, price intervals etc. Let’s assume the following table:

CREATE TABLE age_categories (
  name VARCHAR(50),
  ages INT4RANGE
);

We can now fill the above table as such:

INSERT INTO age_categories
VALUES ('Foetus',   int4range(-1, 0)),
       ('Newborn',  int4range(0, 1)),
       ('Infant',   int4range(1, 2)),
       ('Kid',      int4range(2, 12)),
       ('Teenager', int4range(12, 20)),
       ('Tween',    int4range(20, 30)),
       ('Adult',    int4range(30, 60)),
       ('Senior',   int4range(60, 90)),
       ('Ancient',  int4range(90, 999));

And query it, e.g. by taking my age:

SELECT name
FROM age_categories
WHERE range_contains_elem(ages, 33);

… yielding

name
-----
Adult

There are a lot of other useful functions involved with range calculations. For instance, we can get the age span of a set of categories. Let’s assume we want to have the age span of Kid, Teenager, Senior. Let’s query:

SELECT int4range(min(lower(ages)), max(upper(ages)))
FROM age_categories
WHERE name IN ('Kid', 'Teenager', 'Senior');

… yielding

int4range
---------
[2,90]

And now, let’s go back to fetching all the categories that are within that range:

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

… yielding

name
--------
Kid
Teenager
Tween
Adult
Senior

All of this could have been implemented with values contained in two separate columns, but using ranges is just much more expressive for range arithmetic.

How to use these types with jOOQ?

jOOQ doesn’t include built-in support for these advanced data types, but it allows you to bind these data types to your own, custom representation.

A good representation of PostgreSQL’s range types in Java would be jOOλ’s org.jooq.lambda.tuple.Range type, but you could also simply use int[] or Map.Entry for ranges. When we’re using jOOλ’s Range type, the idea is to be able to run the following statements using jOOQ:

// Assuming this static import:
import static org.jooq.lambda.tuple.Tuple.*;

DSL.using(configuration)
   .insertInto(AGE_CATEGORIES)
   .columns(AGE_CATEGORIES.NAME, AGE_CATEGORIES.AGES)
   .values("Foetus",   range(-1, 0))
   .values("Newborn",  range(0, 1))
   .values("Infant",   range(1, 2))
   .values("Kid",      range(2, 12))
   .values("Teenager", range(12, 20))
   .values("Tween",    range(20, 30))
   .values("Adult",    range(30, 60))
   .values("Senior",   range(60, 90))
   .values("Ancient",  range(90, 999))
   .execute();

And querying…

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .from(AGE_CATEGORIES)
   .where(rangeContainsElem(AGE_CATEGORIES.AGES, 33))
   .fetch();

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

As always, the idea with jOOQ is that the SQL you want to get as output is the SQL you want to write in Java, type safely. In order to be able to write the above, we will need to implement 1-2 missing pieces. First off, we’ll need to create a data type binding (org.jooq.Binding) as described in this section of the manual. The binding can be written as such, using the following Converter:

public class Int4RangeConverter 
implements Converter<Object, Range<Integer>> {
    private static final Pattern PATTERN = 
        Pattern.compile("\\[(.*?),(.*?)\\)");

    @Override
    public Range<Integer> from(Object t) {
        if (t == null)
            return null;

        Matcher m = PATTERN.matcher("" + t);
        if (m.find())
            return Tuple.range(
                Integer.valueOf(m.group(1)), 
                Integer.valueOf(m.group(2)));

        throw new IllegalArgumentException(
            "Unsupported range : " + t);
    }

    @Override
    public Object to(Range<Integer> u) {
        return u == null 
            ? null 
            : "[" + u.v1 + "," + u.v2 + ")";
    }

    @Override
    public Class<Object> fromType() {
        return Object.class;
    }

    @SuppressWarnings({ "unchecked", "rawtypes" })
    @Override
    public Class<Range<Integer>> toType() {
        return (Class) Range.class;
    }
}

… and the Converter can then be re-used in a Binding:

public class PostgresInt4RangeBinding 
implements Binding<Object, Range<Integer>> {

    @Override
    public Converter<Object, Range<Integer>> converter() {
        return new Int4RangeConverter();
    }

    @Override
    public void sql(BindingSQLContext<Range<Integer>> ctx) throws SQLException {
        ctx.render()
           .visit(DSL.val(ctx.convert(converter()).value()))
           .sql("::int4range");
    }

    // ...
}

The important thing in the binding is just that every bind variable needs to be encoded in PostgreSQL’s string format for range types (i.e. [lower, upper)) and cast explicitly to ?::int4range, that’s all.

You can then configure your code generator to use those types, e.g. on all columns that are called [xxx]_RANGE

<customType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <type>org.jooq.lambda.tuple.Range&lt;Integer></type>
    <binding>com.example.PostgresInt4RangeBinding</binding>
</customType>
<forcedType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <expression>.*?_RANGE</expression>
</forcedType>

The last missing thing now are the functions that you need to compare ranges, i.e.:

  • rangeContainsElem()
  • rangeOverlaps()
  • int4range()
  • lower()
  • upper()

These are written quickly:

static <T extends Comparable<T>> Condition 
    rangeContainsElem(Field<Range<T>> f1, T e) {
    return DSL.condition("range_contains_elem({0}, {1})", f1, val(e));
}

static <T extends Comparable<T>> Condition 
    rangeOverlaps(Field<Range<T>> f1, Range<T> f2) {
    return DSL.condition("range_overlaps({0}, {1})", f1, val(f2, f1.getDataType()));
}

Conclusion

Writing an extension for jOOQ data types takes a bit of time and effort, but it is really easy to achieve and will allow you to write very powerful queries in a type safe way. Once you’ve set up all the data types and the bindings, your generated source code will reflect the actual data type from your database and you will be able to write powerful queries in a type safe way directly in Java. Let’s consider again the plain SQL and the jOOQ version:

SQL

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

jOOQ

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

More information about data type bindings can be found in the jOOQ manual.