PostgreSQL’s Best-Kept Secret, and how to Use it with jOOQ

PostgreSQL has a lot of secret data types. In recent times, PostgreSQL’s JSON and JSONB support was hyped as being the NoSQL on SQL secret (e.g. as advertised by ToroDB) that allows you to get the best out of both worlds. But there are many other useful data types, among which the range type.

How does the range type work?

Ranges are very useful for things like age, date, price intervals etc. Let’s assume the following table:

CREATE TABLE age_categories (
  name VARCHAR(50),
  ages INT4RANGE
);

We can now fill the above table as such:

INSERT INTO age_categories
VALUES ('Foetus',   int4range(-1, 0)),
       ('Newborn',  int4range(0, 1)),
       ('Infant',   int4range(1, 2)),
       ('Kid',      int4range(2, 12)),
       ('Teenager', int4range(12, 20)),
       ('Tween',    int4range(20, 30)),
       ('Adult',    int4range(30, 60)),
       ('Senior',   int4range(60, 90)),
       ('Ancient',  int4range(90, 999));

And query it, e.g. by taking my age:

SELECT name
FROM age_categories
WHERE range_contains_elem(ages, 33);

… yielding

name
-----
Adult

There are a lot of other useful functions involved with range calculations. For instance, we can get the age span of a set of categories. Let’s assume we want to have the age span of Kid, Teenager, Senior. Let’s query:

SELECT int4range(min(lower(ages)), max(upper(ages)))
FROM age_categories
WHERE name IN ('Kid', 'Teenager', 'Senior');

… yielding

int4range
---------
[2,90]

And now, let’s go back to fetching all the categories that are within that range:

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

… yielding

name
--------
Kid
Teenager
Tween
Adult
Senior

All of this could have been implemented with values contained in two separate columns, but using ranges is just much more expressive for range arithmetic.

How to use these types with jOOQ?

jOOQ doesn’t include built-in support for these advanced data types, but it allows you to bind these data types to your own, custom representation.

A good representation of PostgreSQL’s range types in Java would be jOOλ’s org.jooq.lambda.tuple.Range type, but you could also simply use int[] or Map.Entry for ranges. When we’re using jOOλ’s Range type, the idea is to be able to run the following statements using jOOQ:

// Assuming this static import:
import static org.jooq.lambda.tuple.Tuple.*;

DSL.using(configuration)
   .insertInto(AGE_CATEGORIES)
   .columns(AGE_CATEGORIES.NAME, AGE_CATEGORIES.AGES)
   .values("Foetus",   range(-1, 0))
   .values("Newborn",  range(0, 1))
   .values("Infant",   range(1, 2))
   .values("Kid",      range(2, 12))
   .values("Teenager", range(12, 20))
   .values("Tween",    range(20, 30))
   .values("Adult",    range(30, 60))
   .values("Senior",   range(60, 90))
   .values("Ancient",  range(90, 999))
   .execute();

And querying…

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .from(AGE_CATEGORIES)
   .where(rangeContainsElem(AGE_CATEGORIES.AGES, 33))
   .fetch();

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

As always, the idea with jOOQ is that the SQL you want to get as output is the SQL you want to write in Java, type safely. In order to be able to write the above, we will need to implement 1-2 missing pieces. First off, we’ll need to create a data type binding (org.jooq.Binding) as described in this section of the manual. The binding can be written as such, using the following Converter:

public class Int4RangeConverter 
implements Converter<Object, Range<Integer>> {
    private static final Pattern PATTERN = 
        Pattern.compile("\\[(.*?),(.*?)\\)");

    @Override
    public Range<Integer> from(Object t) {
        if (t == null)
            return null;

        Matcher m = PATTERN.matcher("" + t);
        if (m.find())
            return Tuple.range(
                Integer.valueOf(m.group(1)), 
                Integer.valueOf(m.group(2)));

        throw new IllegalArgumentException(
            "Unsupported range : " + t);
    }

    @Override
    public Object to(Range<Integer> u) {
        return u == null 
            ? null 
            : "[" + u.v1 + "," + u.v2 + ")";
    }

    @Override
    public Class<Object> fromType() {
        return Object.class;
    }

    @SuppressWarnings({ "unchecked", "rawtypes" })
    @Override
    public Class<Range<Integer>> toType() {
        return (Class) Range.class;
    }
}

… and the Converter can then be re-used in a Binding:

public class PostgresInt4RangeBinding 
implements Binding<Object, Range<Integer>> {

    @Override
    public Converter<Object, Range<Integer>> converter() {
        return new Int4RangeConverter();
    }

    @Override
    public void sql(BindingSQLContext<Range<Integer>> ctx) throws SQLException {
        ctx.render()
           .visit(DSL.val(ctx.convert(converter()).value()))
           .sql("::int4range");
    }

    // ...
}

The important thing in the binding is just that every bind variable needs to be encoded in PostgreSQL’s string format for range types (i.e. [lower, upper)) and cast explicitly to ?::int4range, that’s all.

You can then configure your code generator to use those types, e.g. on all columns that are called [xxx]_RANGE

<customType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <type>org.jooq.lambda.tuple.Range&lt;Integer></type>
    <binding>com.example.PostgresInt4RangeBinding</binding>
</customType>
<forcedType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <expression>.*?_RANGE</expression>
</forcedType>

The last missing thing now are the functions that you need to compare ranges, i.e.:

  • rangeContainsElem()
  • rangeOverlaps()
  • int4range()
  • lower()
  • upper()

These are written quickly:

static <T extends Comparable<T>> Condition 
    rangeContainsElem(Field<Range<T>> f1, T e) {
    return DSL.condition("range_contains_elem({0}, {1})", f1, val(e));
}

static <T extends Comparable<T>> Condition 
    rangeOverlaps(Field<Range<T>> f1, Range<T> f2) {
    return DSL.condition("range_overlaps({0}, {1})", f1, val(f2, f1.getDataType()));
}

Conclusion

Writing an extension for jOOQ data types takes a bit of time and effort, but it is really easy to achieve and will allow you to write very powerful queries in a type safe way. Once you’ve set up all the data types and the bindings, your generated source code will reflect the actual data type from your database and you will be able to write powerful queries in a type safe way directly in Java. Let’s consider again the plain SQL and the jOOQ version:

SQL

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

jOOQ

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

More information about data type bindings can be found in the jOOQ manual.

How SQL GROUP BY Should Have Been Designed – Like Neo4j’s Implicit GROUP BY

In the recent past, we’ve explained the syntactic implications of the SQL GROUP BY clause. If you haven’t already, you should read our article “Do You Really Understand SQL’s GROUP BY and HAVING clauses?“.

In essence, adding a GROUP BY clause to your query transforms your query on very implicit levels. The following reminder summarises the previous article:

  • Only column expressions referenced in the GROUP BY clause, or aggregations of other column expressions may appear in the SELECT clause
  • Aggregations without explicit GROUP BY clause imply the “grand total” GROUP BY () clause
  • Some databases (e.g. MySQL, and to some extent: the SQL standard) don’t follow these rules and allow for arbitrary column expressions (or at least functionally dependent column expressions) in the SELECT clause

How SQL GROUP BY should have been designed

There is another way of looking at GROUP BY, and it has been implemented in the equally fascinating, beautiful, and weird Cypher query language (those are good attributes) as supported by the Neo4j graph database. This alternative (yet SQL inspired) query language probably deserves a whole blog post series on its own, but let’s focus on aggregation. Because aggregation is the primary use case for grouping.

(for the record, check out the Neo4j docs about aggregation for details)

A quick wrap-up to understand Cypher:

Consider this simple Cypher query:

MATCH (me:Person)-->(friend:Person)
                 -->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), 
       count(friend_of_friend)

Furthermore

  • Cypher
    (me:Person)-->(friend:Person)
               -->(friend_of_friend:Person)
    

    corresponds roughly to SQL

         Person AS me 
    JOIN Person AS friend 
      ON [ implicit equi-join predicate ]
    JOIN Person as friend_of_friend
      ON [ implicit equi-join predicate ]
    

Cypher’s way of writing JOIN is actually extremely useful and could also be applied to SQL. It is only a matter of time until someone will write a Cypher-to-SQL transformer that implements the syntax, at least as syntactic sugar for the equivalent ANSI equi-join notation.

Let’s investigate aggregation in Cypher

Here’s the query again:

MATCH (me:Person)-->(friend:Person)
                 -->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), 
       count(friend_of_friend)

So, in SQL terms, this is exactly the same as:

SELECT count(DISTINCT friend_of_friend), 
       count(friend_of_friend)
FROM   [ Persons ... ]
WHERE  me.name = 'A'

In other words, the same implicit grand total GROUP BY () is implied and all values are aggregated into a single row.

The next example from the Neo4j docs is more intriguing. This will count the number of nodes connected to a node n with name = 'A':

MATCH (n { name: 'A' })-->(x)
RETURN n, count(*)

Which is a shorter form for writing:

MATCH (n)-->(x)
WHERE n.name = 'A'
RETURN n, count(*)

This example will also perform aggregation, but this time with an implicit GROUP BY n clause. In SQL, you’d write something like:

SELECT   n.id, count(*)
FROM     n
JOIN     x
  ON     [ implicit equi-join predicate ]
WHERE    n.name = 'A'
GROUP BY n.id

The nice thing in Cypher is that the obvious GROUP BY clause (it can only be GROUP BY n.id) is implied. It doesn’t have to be written explicitly.

Takeaway for SQL

We’ve seen a couple of nice Cypher language features, especially the incredibly nice way to write “JOIN” (or rather graph traversal in Neo4j). But a much more obvious, low-hanging fruit with actual chances to make it into the SQL standard would be to make the SQL GROUP BY clause optional, and dependent on the SELECT clause using the following rules:

  • If SELECT contains no aggregation functions, there shall be no implied GROUP BY clause
  • If SELECT contains 1-N aggregation functions, there shall be an implied GROUP BY clause formed from the remaining columns
  • If SELECT contains only aggregation functions, the “grand total” GROUP BY () shall apply
  • An explicit GROUP BY clause will always be preferred to any implied GROUP BY clause

If any of you ISO / IEC committee members are reading this, this is on my wish list for a future SQL standard. And please, PostgreSQL. Implement this right away.

Liked this article?

Here’s some further reading about the SQL GROUP BY clause and aggregation: