Prevent SQL Injection with SQL Builders Like jOOQ

As long as we allow ourselves to write string-based dynamic SQL embedded in other programming languages like Java, we will have a certain risk of being vulnerable to SQL injection. That’s a fact. Don’t believe it? Check out this website exposing all vulnerabilities on Stack Overflow for PHP questions:

https://laurent22.github.io/so-injections

In a previous blog post, I’ve shown how fatal such a single vulnerability can be, if discovered. A lot of blog posts out there warn about the potential of attackers injecting a DROP DATABASE statement.

18mpenleoksq8jpg

Also, everyone knows this famous xkcd:
https://xkcd.com/327

But in my opinion, a much more important threat is not immediate damage to your system, but data leakage. Using tools like sqlmap, every script kiddie can download your credit card information and other sensitive data from your database.

The best remedy: Avoid string-based embedded SQL

The best remedy is to always avoid string-based embedded SQL, whenever you can. I.e. try not to do things like this too often:

try (PreparedStatement s = c.prepareStatement(
    "SELECT first_name, last_name "
  + "FROM users "
  + "WHERE user_id = ? ")) {

    s.setInt(1, userId);
    try (ResultSet rs = s.executeQuery()) {
        ...
    }
}

Sure, there is currently nothing wrong with the above Java code. There is no vulnerability as we’re using a bind variable. But the risk of some developer not paying attention and accidentally adding a vulnerability when adding another predicate is too high.

Fine, so we’re not using string-based embedded SQL. But what are the alternatives?

Use a query DSL like jOOQ to build your dynamic SQL

APIs like jOOQ help you build SQL statements in a type safe, composable way as if the Java language actually understood SQL. The previous JDBC prepared statement now translates to the following:

Result<?> result =
ctx.select(USERS.FIRST_NAME, USERS.LAST_NAME)
   .from(USERS)
   // Bind variable embedded in statement
   .where(USERS.USER_ID.eq(userId))
   .fetch();

In jOOQ, the underlying JDBC PreparedStatement is created transparently and the userId bind variable is placed right in the middle of the statement so you don’t have to worry about the boring details. There’s no way you can have any accidental SQL injection vulnerability in such jOOQ API calls, because every SQL clause and expression is part of an expression tree that is managed by jOOQ.

And what’s best: The Java compiler can now type check your SQL statement to a certain degree. This is a huge benefit in terms of productivity and code quality.

Of course, SQL builders aren’t a perfect shield for SQL injection, as they usually expose some API to insert custom SQL strings. For instance, in jOOQ, you can write:

Result<?> result =
ctx.select(USERS.FIRST_NAME, USERS.LAST_NAME)
   .from(USERS)
   // Bind variable embedded in "plain SQL" string
   .where("user_id = ?", userId)
   .fetch();

This alternative API usage is just as convenient as the previous one. The bind variable is located right where it is needed (not applied later on a PreparedStatement), but of course this is again string-based embedded SQL, which is potentially vulnerable if you get it wrong.

The important thing here is that the string-based method is the exception, which is used only very rarely when you need a very advanced SQL feature that is not supported by jOOQ. By default, you’re using the “save” approach without strings. And since jOOQ 3.9, you can also add a type checker that generates a compilation warning or error as soon as you’re using jOOQ’s “plain SQL” API. Read more about that here:
https://blog.jooq.org/2016/05/09/jsr-308-and-the-checker-framework-add-even-more-typesafety-to-jooq-3-9

Use views and stored procedures

The safest and least intrusive way to prevent such problems, of course, is not to use embedded SQL at all, but to use views and/or stored procedures instead. By moving and storing the SQL statements into the database, you get a few advantages in addition to the lack of SQLi vulnerability:

  • The database can type-check your statement in its entirety
  • You can easily reuse a statement in different applications, e.g. not all written in Java
  • You could even revoke grants to the underlying tables and grant access only to these views and stored procedures, which adds another layer of security to your system

If you’re writing PL/SQL in Oracle, the previous statement would be transformed to something like this:

FOR rec IN (
  SELECT first_name, last_name
  FROM users
  -- Bind variable again embedded in SQL statement
  WHERE user_id = l_user_id
) LOOP
  ...
END LOOP;

Just like with jOOQ, you can easily embed the SQL statement in the “host language”, except that this time, it isn’t Java but PL/SQL. There’s absolutely no way the above statement will ever be vulnerable to SQL injection, as the statement isn’t even a dynamic SQL statement anymore.

Again, you can write dynamic SQL also in PL/SQL (or other database’s stored procedure languages), e.g. by using EXECUTE IMMEDIATE, but the important thing is again that dynamic, string-based, embedded SQL is the exception not the default. That’s the important thing here!

Now, if you do decide to use stored procedures, then jOOQ is again here to help you, as you can easily call that procedure from jOOQ, again in a type safe way. More about that in this blog post:
Painless Access from Java to PL/SQL Procedures with jOOQ

Beware! SQL isn’t the only language vulnerable to injection

Vlad Mihalcea displayed an equally important threat to JPA based applications. Scroll down in his blog post to find him mentioning JPQL injection:
A beginner’s guide to SQL injection and how you should prevent it

Yes, if you’re doing something silly as:

public List<Post> getPostsByTitle(String title) {
    return doInJPA(entityManager -> {
        return entityManager.createQuery(
            "select p " +
            "from Post p " +
            "where" +
            "   p.title = '" + title + "'", Post.class)
        .getResultList();
    });
}

… then an attacker can inject any sort of JPQL code into the title variable. The possibilities are a bit more limited than with SQL injection, but it is still perfectly possible to read and dump the entire database (including credit card information, remember?) from such a vulnerability.

Again, if you’re doing this quite often, you should consider switching to a more type safe way to write JPQL (e.g. using the the infamous Criteria API), or again switch to SQL and jOOQ.

Read Vlad’s post for more details:
A beginner’s guide to SQL injection and how you should prevent it

Conclusion

There are many external DSLs, like SQL, JPQL, XPath, regular expressions, and what not. Some of them are extremely powerful and they’re used to operate on sensitive data. Which means that if you leak control of the language outside of your application, you’re very vulnerable.

Vulnerabiliy mostly happens when you embed those external DSLs into Java in a string-based form. The best remedies are:

  • Use an internal DSL that models the external DSL instead
  • Keep the external DSL external, e.g. by using views and stored procedures

Both of these approaches work because they both avoid strings.

Use jOOQ to Read / Write Oracle PL/SQL RECORD Types

Some of the biggest limitations when working with Oracle PL/SQL from Java is the lack of support for a variety of PL/SQL features through the JDBC interface. This lack of support is actually not limited to JDBC, but also extends to Oracle SQL. For instance, if you’re using the useful PL/SQL BOOLEAN type as such:

CREATE OR REPLACE FUNCTION yes RETURN boolean AS
BEGIN
  RETURN true;
END yes;
/

It would now be terrific if you could do this:

SELECT yes FROM dual;

But it’s not possible. You’ll be getting an error along the lines of the following:

ORA-06552: PL/SQL: Statement ignored
ORA-06553: PLS-382: expression is of wrong type
06552. 00000 -  "PL/SQL: %s"
*Cause:    
*Action:

It’s crazy to think that the Oracle SQL language still doesn’t support the SQL standard boolean type, which is so useful as I’ve shown in previous blog posts. Here’s where you can upvote the feature request:

https://community.oracle.com/ideas/2633

BOOLEAN isn’t the only “inaccessible” SQL feature

Now, there are a couple of other data types, which cannot be “bridged” to the SQL engine, and thus (for some reason only the OJDBC driver gods can fathom) cannot be “bridged” to a JDBC client. Among them: The very useful PL/SQL RECORD type.

Very often, you want to do this:

CREATE PACKAGE customers AS
  TYPE person IS RECORD (
    first_name VARCHAR2(50),
    last_name VARCHAR2(50)
  );
  
  FUNCTION get_customer(p_customer_id NUMBER) RETURN person;
END customers;
/

CREATE PACKAGE BODY customers AS
  FUNCTION get_customer(p_customer_id NUMBER) RETURN person IS
    v_person customers.person;
  BEGIN
    SELECT c.first_name, c.last_name
    INTO v_person
    FROM customer c
    WHERE c.customer_id = p_customer_id;
    
    RETURN v_person;
  END get_customer;
END customers;
/

(we’re running this on the SAKILA database).

As in any language with the least bit of sophistication, we can define “structs” or records in PL/SQL, which we can now frequently reuse. Everyone knows what a PERSON is and we can pass them around between procedures and functions.

For instance, in PL/SQL client code:

SET SERVEROUTPUT ON
DECLARE
  v_person customers.person;
BEGIN
  v_person := customers.get_customer(1);
  
  dbms_output.put_line(v_person.first_name);
  dbms_output.put_line(v_person.last_name);
END;
/

… which yields:

MARY
SMITH

What about JDBC client code?

After having added support for PL/SQL BOOLEAN types in jOOQ 3.9, with jOOQ 3.9, we now finally support PL/SQL record types in stored procedures as well, at least in standalone calls, which are not embedded in SQL statements. The jOOQ code generator will pick up all of these package-level PL/SQL record types and their structures and generate the boring boiler plate code for you. E.g. (simplified):

package org.jooq.demo.sakila.packages.customers.records;

public class PersonRecord extends UDTRecordImpl<PersonRecord> {
    public void   setFirstName(String value) { ... }
    public String getFirstName()             { ... }
    public void   setLastName(String value)  { ... }
    public String getLastName()              { ... }

    public PersonRecord() { ... }
    public PersonRecord(String firstName, String lastName) { ... }
}

Notice how jOOQ doesn’t really make any difference in its API between the generated code for an Oracle SQL OBJECT type or an Oracle PL/SQL RECORD type. They’re essentially the same thing (from a jOOQ API perspective).

More interesting, what happened to the generated package and the function? This code is generated (simplified):

public class Customers extends PackageImpl {
    public static PersonRecord getCustomer(
        Configuration configuration, Long pCustomerId
    ) { ... }
}

That’s all! So all we now need to do is pass the ubiquitous jOOQ Configuration (which contains information such as SQLDialect or JDBC Connection) and the actual stored function parameter, the P_CUSTOMER_ID value, and we’re done!

This is how jOOQ client code might look:

PersonRecord person = Customers.getCustomer(configuration(), 1L);
System.out.println(person);

As you can see, this is just the same thing as the corresponding PL/SQL code. And the output of this println call is this:

SAKILA.CUSTOMERS.PERSON('MARY', 'SMITH')

A fully qualified RECORD declaration with schema, package, and type name.

How does it work?

Let’s turn on jOOQ’s built-in TRACE logging to see what jOOQ did behind the scenes:

Calling routine          : 
  DECLARE
    r1 "CUSTOMERS"."PERSON";
  BEGIN
    r1 := "CUSTOMERS"."GET_CUSTOMER"("P_CUSTOMER_ID" => ?);
    ? := r1."FIRST_NAME";
    ? := r1."LAST_NAME";
  END;
Binding variable 1       : 1 (class java.lang.Long)
Registering variable 2   : class java.lang.String
Registering variable 3   : class java.lang.String
Fetched OUT parameters   : +-----------------+
                         : |RETURN_VALUE     |
                         : +-----------------+
                         : |('MARY', 'SMITH')|
                         : +-----------------+

So, jOOQ usually doesn’t use JDBC’s very limited escape syntax to call stored procedures, it just produces an anonymous PL/SQL block with a local variable of type CUSTOMER.PERSON, i.e. of our RECORD type. The function call is then assigned to this local variable, and the local variable is descructured into its individual parts.

In the TRACE log, you can see the individual bind variables, i.e. there’s an IN variable at index 1 of type Long, and two OUT variables of type String at indexes 2 and 3.

How does jOOQ know the record types?

At runtime, all the information is hard-wired to the generated code. So, the magic is inside of the code generator. Warning: Some serious SQL ahead!

This beauty here queries the dictionary views for PL/SQL record types:

AllArguments a = ALL_ARGUMENTS.as("a");
AllArguments x = ALL_ARGUMENTS.as("x");
Field<BigDecimal> nextSibling = field(name("next_sibling"), x.SEQUENCE.getDataType());

DSL.using(configuration)
   .select(x.TYPE_OWNER, x.TYPE_NAME, x.TYPE_SUBNAME)
   .select(
       a.ARGUMENT_NAME                                               .as(ALL_TYPE_ATTRS.ATTR_NAME),
       a.SEQUENCE                                                    .as(ALL_TYPE_ATTRS.ATTR_NO),
       a.TYPE_OWNER                                                  .as(ALL_TYPE_ATTRS.ATTR_TYPE_OWNER),
       nvl2(a.TYPE_SUBNAME, a.TYPE_NAME, inline(null, a.TYPE_NAME))  .as("package_name"),
       coalesce(a.TYPE_SUBNAME, a.TYPE_NAME, a.DATA_TYPE)            .as(ALL_TYPE_ATTRS.ATTR_TYPE_NAME),
       a.DATA_LENGTH                                                 .as(ALL_TYPE_ATTRS.LENGTH),
       a.DATA_PRECISION                                              .as(ALL_TYPE_ATTRS.PRECISION),
       a.DATA_SCALE                                                  .as(ALL_TYPE_ATTRS.SCALE))
   .from(a)
   .join(table(
       select(
           a.TYPE_OWNER,
           a.TYPE_NAME,
           a.TYPE_SUBNAME,
           min(a.OWNER        ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.OWNER),
           min(a.PACKAGE_NAME ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.PACKAGE_NAME),
           min(a.SUBPROGRAM_ID).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.SUBPROGRAM_ID),
           min(a.SEQUENCE     ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.SEQUENCE),
           min(nextSibling    ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(nextSibling),
           min(a.DATA_LEVEL   ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.DATA_LEVEL))
      .from(table(
          select(
              lead(a.SEQUENCE, 1, inline(new BigDecimal(99999999))).over(
                  partitionBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.DATA_LEVEL)
                 .orderBy(a.SEQUENCE)
              ).as("next_sibling"),
              a.TYPE_OWNER,
              a.TYPE_NAME,
              a.TYPE_SUBNAME,
              a.OWNER,
              a.PACKAGE_NAME,
              a.SUBPROGRAM_ID,
              a.SEQUENCE,
              a.DATA_LEVEL,
              a.DATA_TYPE)
         .from(a)
         .where(a.OWNER.in(getInputSchemata()))
      ).as("a"))
      .where(a.TYPE_OWNER.in(getInputSchemata()))
      .and(a.OWNER.in(getInputSchemata()))
      .and(a.DATA_TYPE.eq("PL/SQL RECORD"))
      .groupBy(a.TYPE_OWNER, a.TYPE_NAME, a.TYPE_SUBNAME)
   ).as("x"))
   .on(row(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID).eq(x.OWNER, x.PACKAGE_NAME, x.SUBPROGRAM_ID))
   .and(a.SEQUENCE.between(x.SEQUENCE).and(nextSibling))
   .and(a.DATA_LEVEL.eq(x.DATA_LEVEL.plus(one())))
   .orderBy(x.TYPE_OWNER, x.TYPE_NAME, x.TYPE_SUBNAME, a.SEQUENCE)
   .fetch();

This is a nice little jOOQ query, which corresponds to the following equially impressive SQL query, which you can run directly in your SQL developer, or some other SQL client for Oracle:

SELECT 
  "x"."TYPE_OWNER",
  "x"."TYPE_NAME",
  "x"."TYPE_SUBNAME",
  "a"."ARGUMENT_NAME" "ATTR_NAME",
  "a"."SEQUENCE" "ATTR_NO",
  "a"."TYPE_OWNER" "ATTR_TYPE_OWNER",
  nvl2("a"."TYPE_SUBNAME", "a"."TYPE_NAME", NULL) "package_name",
  COALESCE("a"."TYPE_SUBNAME", "a"."TYPE_NAME", "a"."DATA_TYPE") "ATTR_TYPE_NAME",
  "a"."DATA_LENGTH" "LENGTH",
  "a"."DATA_PRECISION" "PRECISION",
  "a"."DATA_SCALE" "SCALE"
FROM "SYS"."ALL_ARGUMENTS" "a"
JOIN (
  SELECT 
    "a"."TYPE_OWNER",
    "a"."TYPE_NAME",
    "a"."TYPE_SUBNAME",
    MIN("a"."OWNER") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "OWNER",
    MIN("a"."PACKAGE_NAME") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "PACKAGE_NAME",
    MIN("a"."SUBPROGRAM_ID") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "SUBPROGRAM_ID",
    MIN("a"."SEQUENCE") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "SEQUENCE",
    MIN("next_sibling") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "next_sibling",
    MIN("a"."DATA_LEVEL") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "DATA_LEVEL"
  FROM (
    SELECT 
	  lead("a"."SEQUENCE", 1, 99999999) OVER (
	    PARTITION BY "a"."OWNER", "a"."PACKAGE_NAME", "a"."SUBPROGRAM_ID", "a"."DATA_LEVEL" 
		ORDER BY "a"."SEQUENCE" ASC
	  ) "next_sibling",
      "a"."TYPE_OWNER",
      "a"."TYPE_NAME",
      "a"."TYPE_SUBNAME",
      "a"."OWNER",
      "a"."PACKAGE_NAME",
      "a"."SUBPROGRAM_ID",
      "a"."SEQUENCE",
      "a"."DATA_LEVEL",
      "a"."DATA_TYPE"
    FROM "SYS"."ALL_ARGUMENTS" "a"
    WHERE "a"."OWNER" IN ('SAKILA', 'SYS')     -- Possibly replace schema here
    ) "a"
  WHERE ("a"."TYPE_OWNER" IN ('SAKILA', 'SYS') -- Possibly replace schema here
  AND "a"."OWNER"         IN ('SAKILA', 'SYS') -- Possibly replace schema here
  AND "a"."DATA_TYPE"      = 'PL/SQL RECORD')
  GROUP BY 
    "a"."TYPE_OWNER",
    "a"."TYPE_NAME",
    "a"."TYPE_SUBNAME"
  ) "x"
ON (("a"."OWNER", "a"."PACKAGE_NAME", "a"."SUBPROGRAM_ID")
 = (("x"."OWNER", "x"."PACKAGE_NAME", "x"."SUBPROGRAM_ID"))
AND "a"."SEQUENCE" BETWEEN "x"."SEQUENCE" AND "next_sibling"
AND "a"."DATA_LEVEL" = ("x"."DATA_LEVEL" + 1))
ORDER BY 
  "x"."TYPE_OWNER" ASC,
  "x"."TYPE_NAME" ASC,
  "x"."TYPE_SUBNAME" ASC,
  "a"."SEQUENCE" ASC

Whew.

The output shows that we got all the required information for our RECORD type:

[  TYPE INFO COLUMNS  ]   ATTR_NAME   ATTR_TYPE_NAME  LENGTH
SAKILA CUSTOMERS PERSON	  FIRST_NAME  VARCHAR2        50
SAKILA CUSTOMERS PERSON	  LAST_NAME   VARCHAR2        50

All of this also works for:

  • Nested types
  • Multiple IN and OUT parameters

I’ll blog about a more advanced use-case in the near future, so stay tuned.

A Hidden jOOQ Gem: Foreach Loop Over ResultQuery

A recent question on Stack Overflow about jOOQ caught my attention. The question essentially asked:

Why do both of these loops work?

// With fetch()
for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)
    .fetch()) { // fetch() here

    doThingsWithRecord(rec);
}

// Without fetch()
for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)) { // No fetch() here

    doThingsWithRecord(rec);
}

And indeed, just like in PL/SQL, you can use any jOOQ ResultQuery as a Java 5 Iterable, because that’s what it is. An Iterable<R> where R extends Record.

The semantics is simple. When Iterable.iterator() is invoked, the query is executed and the Result.iterator() is returned. So, the result is materialised in the client memory just like if I called fetch(). Unsurprisingly, this is the implementation of AbstractResultQuery.iterator():

@Override
public final Iterator<R> iterator() {
    return fetch().iterator();
}

No magic. But it’s great that this works like PL/SQL:

FOR rec IN (SELECT * FROM my_table ORDER BY my_table.column)
LOOP
  doThingsWithRecord(rec);
END LOOP;

Note, unfortunately, there’s no easy way to manage resources through Iterable, i.e. there’s no AutoCloseableIterable returning an AutoCloseableIterator, which could be used in an auto-closing try-with-resources style loop. This is why the entire result set needs to be fetched at the beginning of the loop. For lazy fetching, you can still use ResultQuery.fetchLazy()

try (Cursor<MyTableRecord> cursor = DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)
    .fetchLazy()) {

    for (MyTableRecord rec : cursor)
        doThingsWithRecord(rec);
}

Happy coding!

Why Most Programmers Get Pagination Wrong

Pagination is one of those things that almost everyone gets wrong for two reasons:

  • User experience
  • Database performance

Here’s why.

What’s wrong with pagination?

Most applications blindly produce pagination like this:

pagination

This is how GMail implements pagination. With my current settings, it displays 100 E-Mails at a time and also shows how many E-Mails there are in total, namely 1094. Those aren’t the total number of E-Mails I’ve ever had, they’re the total number of E-Mails in my “blog-idea” label (I’m using GMail as a TODO list, and yes, this blog won’t run out of articles any time soon).

What’s wrong with this practice?

Bad user experience

As a user, in most cases, I don’t really care about the total number of objects that are in my result set. At least, I don’t care about the exact number. Does it make any difference if I have 1094 or 1093 E-Mails? What about if I had 1067? Or 1000? 1000 would be precise enough for what I’m concerned.

Also, as a user, in most cases, I don’t care that I’m on page 317 of my paginated screen that displays me rows 3170-3179 (assuming 10 rows per page). I really don’t. The page number is absolutely useless in terms of user experience.

Who got it right?

  • Facebook
  • Twitter
  • Reddit

And all the other websites that do timelines. Yes, I want to display only 10 rows at a time (or perhaps 100, but almost never all). So, pagination is important. But I don’t care about the fact that I’ve clicked 317 times on that “next page” button. If I ever browse that many pages (and I hardly ever do), then the only thing that matters is the next 10 rows. Just like when you play Civilization. You don’t care that you’re in turn 317. You just want to play one more turn:

4b665f86cbb10d44e2db6ae4c96fef4050f0ce42878015ab30cf681b84537a30[1]

Moreover, I never ever ever want to jump to page 317 right from the beginning. There’s absolutely no use case out there, where I search for something, and then I say, hey, I believe my search result will be item #3175 in the current sort order. Exactly. Instead, I will do any of these:

  • Refine the search
  • Sort the result

In both cases, I will get a result where the record that I’m looking for is much more likely to appear on page #1 or perhaps #2. Again, when was the last time you googled for SQL and then went to page #18375 to find that particular blog post that you were looking for? No. You searched for “Java 8 SQL” to find jOOQ, the best way to write SQL in Java 8. For instance.

How to implement a timeline with SQL

If your data source is a SQL database, you might have implemented pagination by using LIMIT .. OFFSET, or OFFSET .. FETCH or some ROWNUM / ROW_NUMBER() filtering (see the jOOQ manual for some syntax comparisons across RDBMS). OFFSET is the right tool to jump to page 317, but remember, no one really wants to jump to that page, and besides, OFFSET just skips a fixed number of rows. If there are new rows in the system between the time page number 316 is displayed to a user and when the user skips to page number 317, the rows will shift, because the offsets will shift. No one wants that either, when they click on “next”.

Instead, you should be using what we refer to as “keyset pagination” (as opposed to “offset pagination”). We’ve described this in past articles here:

The SQL syntax is a bit cumbersome as the pagination criteria becomes an ordinary predicate, but if you’re using jOOQ, you can use the simple synthetic SQL SEEK clause as such:

DSL.using(configuration)
   .select(PLAYERS.PLAYER_ID,
           PLAYERS.FIRST_NAME,
           PLAYERS.LAST_NAME,
           PLAYERS.SCORE)
   .from(PLAYERS)
   .where(PLAYERS.GAME_ID.eq(42))
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, 15)
   .limit(10)
   .fetch();

The above will fetch the next 10 players after the player with SCORE 949 and ID 15. The pagination really depends on the ORDER BY clause, which is why you have to provide as many values in the pagination as you provided columns in the ORDER BY clause.

Now, that we’ve fixed the user experience let’s also look at …

How OFFSET pagination is bad for performance

The previously linked articles about keyset pagination also mention the poor performance of OFFSET pagination. Which is kind of obvious as OFFSET has to skip a given number of rows after applying all predicates and sorting, etc. So the database has to do all the work and then throw away 3170 records (if you’re jumping to page 317 with 10 rows per page). What a waste.

The following diagram shows very nicely how OFFSET gets slower and slower for large offsets:

Reproduced from use-the-index-luke.com with permission by Markus Winand

That’s the obvious problem, but there’s another one. People always count the total number of rows to calculate the total number of possible pages. Why? To display nonsense like the following:

Page number:
1 2 3 ... 315 316 317 318 319 ... 50193 50194

Wow. OK so we’re on page number 317, which we don’t really care about in the first place, but we could just as well jump to page number 50194. This means that the database needed to run the query across all the rows just to be sure we get exactly 50194 pages in total.

Google something like page number pagination and observe the number of tutorials that show how you can implement the above nonsense. On Google Image search, you’ll find:

pagination-google

At the same time, the Google search itself reveals:

pagination-google-search

As you can see, Google estimates that there are probably at least 10 pages for your search and you can go “next”. Yes, you can skip some pages, but you cannot skip to a page number 50194, because, again:

  • No one wants that
  • It’s costly to predict, even for Google

In fact, Google search implements keyset pagination as well, just like Twitter, Facebook, Reddit. And they don’t display the total number of pages because counting that total can be very costly, depending on your database.

In particular, databases that do not support window functions will require you to run two separate queries:

  1. The actual query with a LIMIT clause
  2. An additional query replacing the SELECT column list with a simple COUNT(*)

Needless to say that this is not the best approach. If your database supports window functions (read about that miraculous SQL feature here on the jOOQ blog), you could produce the total row count in one go as such:

SELECT 
  rental_date, 
  inventory_id,
  COUNT(*) OVER()
FROM rental
WHERE customer_id = 1
ORDER BY rental_date
LIMIT 10

That COUNT(*) OVER() window function is like an ordinary aggregate function, except that it doesn’t group your results. It just counts all the rows of your result and produces that count in each row, prior to limiting the result to 10.

When run against the Sakila database, the above produces:

rental_date          inventory_id  count
2005-05-25 11:30:37          3021     32
2005-05-28 10:35:23          4020     32
2005-06-15 00:54:12          2785     32
2005-06-15 18:02:53          1021     32
2005-06-15 21:08:46          1407     32
2005-06-16 15:18:57           726     32
2005-06-18 08:41:48           197     32
2005-06-18 13:33:59          3497     32
2005-06-21 06:24:45          4566     32
2005-07-08 03:17:05          1443     32

So, we’re displaying the first page with 10 rows and we need to provide navigational links for a total of 4 pages because we have a total of 32 rows.

What happens when we benchmark this query on PostgreSQL? The first run doesn’t calculate this COUNT(*) OVER() value, whereas the second one does:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id
      FROM rental
      WHERE customer_id = 1
      ORDER BY rental_date
      LIMIT 10
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id,
        COUNT(*) OVER()
      FROM rental
      WHERE customer_id = 1
      ORDER BY rental_date
      LIMIT 10
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

The result clearly indicates that in PostgreSQL, there’s a significant overhead in calculating this value:

INFO:  Statement 1: 00:00:01.041823
INFO:  Statement 2: 00:00:03.57145

Oracle optimises things a bit better when you’re using ROWNUM to paginate:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 5000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id
      FROM (
        SELECT 
          rental.*, 
          ROWNUM rn
        FROM rental
        WHERE customer_id = 1
        ORDER BY rental_date
      ) rental
      WHERE rn < 5
      ORDER BY rn
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Statement 1: ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id,
        COUNT(*) OVER()
      FROM (
        SELECT 
          rental.*,  
          ROWNUM rn
        FROM rental
        WHERE customer_id = 1
        ORDER BY rental_date
      ) rental
      WHERE rn < 5
      ORDER BY rn
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Statement 2: ' || (SYSTIMESTAMP - v_ts));
END;
/

Result:

Statement 1: X
Statement 2: X +/- 1%

So, the COUNT(*) seems to be calculated “for free”. Bonus question: Why is that?

Due to Oracle license restrictions, we cannot publish benchmark results here, comparing Oracle with PostgreSQL, sorry, but you can run the above code yourself against the Sakila database:
https://www.jooq.org/sakila

Conclusion

TL;DR: OFFSET pagination bad. Keyset pagination good.

no-offset-banner-468x60.white

If you want to paginate in your application, please make sure whether you really, really, really need:

  • Exact page number
  • High page numbers
  • The last page number
  • The total number of rows

Because if you don’t (and in 98% of all UIs, you really don’t), then you can drastically speed up your queries while providing your users a much better experience. If that’s not a win-win situation worth thinking about…?

And don’t forget, jOOQ ships with native keyset pagination support!

“What Java ORM do You Prefer, and Why?” – SQL of Course!

Catchy headline, yes. But check out this Stack Overflow question by user Mike:

(I’m duplicating it here on the blog, as it might be deleted soon)

It’s a pretty open ended question. I’ll be starting out a new project and am looking at different ORMs to integrate with database access.

Do you have any favorites? Are there any you would advise staying clear of?

And the top voted answer (164 points by David Crawshaw is: “Just use SQL”:

I have stopped using ORMs.

The reason is not any great flaw in the concept. Hibernate works well. Instead, I have found that queries have low overhead and I can fit lots of complex logic into large SQL queries, and shift a lot of my processing into the database.

So consider just using the JDBC package.

The second answer (66 points by user simon) is, again: “Just use SQL”:

None, because having an ORM takes too much control away with small benefits. The time savings gained are easily blown away when you have to debug abnormalities resulting from the use of the ORM. Furthermore, ORMs discourage developers from learning SQL and how relational databases work and using this for their benefit.

The third answer (51 points by myself) is saying, once more: “Use SQL” (and use it with jOOQ).

The best way to write SQL in Java

Only the fourth answer (46 points by Abdullah Jibaly) mentiones Hibernate, the most popular ORM in the Java ecosystem.

The truth is, as we’ve shown numerous times on this blog: Hibernate/JPA/ORMs are good tools to get rid of boring (and complex) CRUD. But that’s just boilerplate logic with little value to your business logic. The interesting stuff – the queries, the batch and bulk processing, the analytics, the reporting, they’re all best done with SQL. Here are some additional articles:

Stay tuned as we’re entering an era of programming where object orientation fades, and functional / declarative programming makes data processing extremely easy and lean again.

With Commercial Licensing, Invest in Innovation, not Protection

When people start creating commercially licensed software (like we did, in 2013 with jOOQ), there is always the big looming question:

What do I do about piracy?

I’ve had numerous discussions with fellow entrepreneurs about this topic, and this fear is omnipresent. There has also been a recent discussion on reddit, titled “prevent sharing of a Java library”. I felt reminded of the early commercial jOOQ days, when I discussed the various options / modalities of the new commercial jOOQ license with the Data Geekery legal counsel – which was clearly the biggest financial investment in early company days.

Build licensing around trust, not fear

One thing that bothered me most about jOOQ’s dual license in its early days is that our paying customers will have less rights than our continued Open Source users. Obviously. If you’re using jOOQ with an Open Source database, you can use the jOOQ Open Source Edition for free under the terms of the very permissive Apache License 2.0. You can do pretty much anything with jOOQ including forking it, rewriting it, creating and distributing a new jOOQ (just don’t name it jOOQ, we own the trademark). The only limitation is: it doesn’t work with commercial databases, but you don’t care about that if you’re using MySQL or PostgreSQL for the next 10 years.

If you’re using jOOQ with a commercial database, however, you need to purchase a jOOQ Professional Edition or jOOQ Enterprise Edition license. Of course, there are costs, but that’s not the problem, because jOOQ is awesome and delivers 50x the value it costs.

The problem is that:

  • Interested developers, architects, etc. now have to go through the hassle of convincing their employer’s legal / purchasing / compliance / … departments and do all sorts of paperwork.
  • Paying customers (at the beginning) could no longer patch jOOQ if they found a bug. They had to wait for us, the vendor, to deliver a fix.

These things were remedied rather quickly in the commercial license text. The commercial jOOQ license now grants access to the commercial source code and allows users to modify jOOQ themselves (of course warranty is then disclaimed), without needing to wait for the vendor to deliver the fix. It is still not allowed to distribute the modification’s source code, although we’re looking into possible legal solutions to allow that as well, such that commercial customers can share patches for commercial parts of jOOQ as well.

In other words: We want our commercial customers feel as if jOOQ were Open Source for their every day job.

Isn’t that crazy?

So, people get the entire source code and can build jOOQ. Aren’t we afraid that our commercial, “cracked” jOOQ distributions wind up on warez sites? Of course we are. And it happens. And we’re maintaining a list of companies that “obviously” don’t comply with our terms (e.g. they’ve been using the free trial in production for 2 years). But they’re only very few. And even if they weren’t few, should we introduce tracking logic in jOOQ to check if customers / trial users are compliant? Should we collect IP addresses? User E-Mails? Count workstations? Shut down jOOQ, if non compliant? Shut it down where, on production servers, perhaps?

The truth is, most companies are honest. We’ve had many customers frequently upgrade their contracts. E.g. every couple of months, they’ve purchased new licenses. We’ve had other customers reduce their contracts. The team started with 5 licenses and now consists only of 1 person doing maintenance work. We’ve had customers not touching their contracts, using jOOQ with e.g. 10 licenses for a long time. Some of these are overlicensed, yes. Some of these are underlicensed. It happens. It’s not good, but it’s also not the end of the world. We’re in constant touch with them to see if their license count is still up to date. In the end, we trust them. They trust us.

The worst thing we could do now is introduce some sort of license checker that might be buggy and accidentally shuts down our honest customers’ production system! For the slight chance that we might be catching someone who “cracked” our software (and who probably manages to “crack” also our license checker anyway).

Even if we’re not shutting down the software but logging messages somewhere, the end user (our customers’ customer) might get a very weird feeling when they see it. We would be indirectly damaging our customers’ reputation for what was probably just an oversight. Or, again, a bug in our license checker.

Is that really worth the trouble? No way!

(an exception is the free trial version, which must be deleted after one month)

Different types of software

The problem is: There are different types of software. Essentially, there are:

  • Tools
  • Libraries
  • Servers
  • SaaS

The difference is:

  • Tools run on premise but are independent of the end user application. The customer doesn’t depend on the tool.
    For instance: IntelliJ IDEA
  • Libraries run on premise and are embedded in the end user application. The customer completely depends on the library.
    For instance: jOOQ
  • Servers run on premise and are linked to the end user application but run independently. The customer can often replace the server, but cannot “open” it. It’s a black box.
    For instance: Oracle Database
  • SaaS runs in the cloud and is completely independent of the end user application. The customer cannot influence it in any way.
    For instance: Salesforce.com

As you can see, tools are “not critical” for the customer (being “just” UIs giving access to “the real system”), and while servers and SaaS are critical, all three of them run independently. It is easy for a vendor to implement license checkers inside of those because they’re in control of running the software. With libraries, it’s different. You cannot make any assumptions about how your library is “run”. Is it part of a single JVM? Or multiple JVMs? Run for a couple of seconds only, or for a long time? Used frequently or only very rarely? Is it allowed to spawn its own threads, collect its own data?

Too many open questions.

Long story short: Invest in innovation, not protection

Unless protection is really important in your area because you’re using extremely complex algorithms that no one should know about, then you shouldn’t worry about piracy too much from a technical perspective. In our case, jOOQ isn’t super secret. Anyone can build their own jOOQ, and there are many (much simpler) competitor frameworks. Our business is maintaining by far the best Java SQL DSL and apparently, no one else wants to compete with us in this niche, so why be afraid of theft.

If protection is important, well then run a SaaS, because then you don’t need to ship any software. For instance: Google, one of the biggest SaaS vendors out there, doesn’t share its search engine algorithms with you.

Once you stop worrying about who is going to steal from you, you can start investing all that time in awesome new features and quality to make those loyal and honest customers happy who happily pay for your software. And who knows. Perhaps some of your “pirate customers” will eventually switch jobs and work for someone who takes compliance more seriously. They have been happy “customers” too, and will also recommend your software to their new peers.

Using Stored Procedures With JPA, JDBC… Meh, Just Use jOOQ

The current edition of the Java magazine has an article about Big Data Best Practices for JDBC and JPA by Josh Juneau:
http://www.javamagazine.mozaicreader.com/MayJune2016

The article shows how to use a stored procedure with JDBC (notice how resources aren’t closed, unfortunately. This is commonly forgotten, even in Java Magazine articles)

// Using JDBC to call upon a database stored
// procedure
CallableStatement cs = null;
try {
    cs = conn.prepareCall("{call DUMMY_PROC(?,?)}");
    cs.setString(1, "This is a test");
    cs.registerOutParameter(2, Types.VARCHAR);
    cs.executeQuery();

    // Do something with result
    String returnStr = cs.getString(2);
} catch (SQLException ex){
    ex.printStackTrace();
}

And with JPA:

// Utilize JPA to call a database stored procedure
// Add @NamedStoredProcedureQuery to entity class
@NamedStoredProcedureQuery(
    name="createEmp", procedureName="CREATE_EMP",
    parameters = {
        @StoredProcedureParameter(
            mode= ParameterMode.IN,
            type=String.class,
            name="first"),
        @StoredProcedureParamter(
            mode = ParameterMode.IN,
            type=String.class,
            name="last")
    })

// Calling upon stored procedure
StoredProcedureQuery qry =
    em.createStoredProcedureQuery("createEmp");
qry.setParameter("first", "JOSH");
qry.setParameter("last","JUNEAU");
qry.execute();

Specifically the latter was also recently discussed in blog posts by Vlad Mihalcea and Thorben Janssen.

Do you like verbosity and complexity?

No? We neither. This is why we give you a third option instead: Just use jOOQ. Here’s the equivalent jOOQ code:

// JDBC example:
String returnStr = Routines.dummyProc(
    config, "This is a test");

// JPA example
Routines.createEmp(config, "JOSH", "JUNEAU");

Yes! That’s it. Don’t waste time manually configuring your bind variables with JDBC API calls, or JPA annotations. No one likes writing annotations for stored procedures. With jOOQ and jOOQ’s code generator, procedure calls are:

  • A one-liner
  • A no-brainer
  • A way to bring back the fun to stored procedures

Learn more about using Oracle stored procedures with nested collections and object types here:
https://blog.jooq.org/2014/11/04/painless-access-from-java-to-plsql-procedures-with-jooq