Stop Trying to Emulate SQL OFFSET Pagination with Your In-House DB Framework!

I’m pretty sure you’ve gotten it wrong in numerous ways, so far. And you probably won’t get it right any time soon. So why waste your precious time on SQL tweaking, when you could be implementing business logic?

Let me explain…

It hasn’t been until the recent SQL:2008 standard that what MySQL users know as LIMIT .. OFFSET was standardised into the following simple statement:

SELECT * 
FROM BOOK 
OFFSET 2 ROWS 
FETCH NEXT 1 ROWS ONLY

Yes. So many keywords.

SQL is indeed a very verbose language. Personally, we really like the conciseness of MySQL’s / PostgreSQL’s LIMIT .. OFFSET clause, which is why we chose that for the jOOQ DSL API

In SQL:

SELECT * FROM BOOK LIMIT 1 OFFSET 2

In jOOQ:

select().from(BOOK).limit(1).offset(2);

Now, when you’re a SQL framework vendor, or when you’re rolling your own, in-house SQL abstraction, you might think about standardising this neat little clause. Here’s a couple of flavours from databases that natively support offset pagination:

-- MySQL, H2, HSQLDB, Postgres, and SQLite
SELECT * FROM BOOK LIMIT 1 OFFSET 2

-- CUBRID supports a MySQL variant of the 
-- LIMIT .. OFFSET clause
SELECT * FROM BOOK LIMIT 2, 1

-- Derby, SQL Server 2012, Oracle 12, SQL:2008
SELECT * FROM BOOK 
OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY

-- Ingres. Eek, almost the standard. Almost!
SELECT * FROM BOOK 
OFFSET 2 FETCH FIRST 1 ROWS ONLY

-- Firebird
SELECT * FROM BOOK ROWS 2 TO 3

-- Sybase SQL Anywhere
SELECT TOP 1 ROWS START AT 3 * FROM BOOK

-- DB2 (without OFFSET)
SELECT * FROM BOOK FETCH FIRST 1 ROWS ONLY

-- Sybase ASE, SQL Server 2008 (without OFFSET)
SELECT TOP 1 * FROM BOOK

So far, so good. These can all be handled. Some databases put offsets before limits, others put limits before offsets, and the T-SQL family puts the whole TOP clause before the SELECT list. This is easy to emulate. Now what about:

  • Oracle 11g and less
  • SQL Server 2008 and less
  • DB2 with OFFSET

(note that you can enable various alternative syntaxes in DB2)

When you google for this, you will find millions of ways to emulate OFFSET .. FETCH in those older databases. The optimal solutions always involve:

  • Using doubly-nested derived tables with ROWNUM filtering in Oracle
  • Using single-nested derived tabels with ROW_NUMBER() filtering in SQL Server and DB2

So you’re emulating it.

Do you think you will get it right?

;-)

Let us go through a couple of issues that you may not have thought about.

First off, Oracle. Oracle obviously wanted to create a maximum vendor-lockin, which is only exceeded by Apple’s recent introduction of Swift. This is why ROWNUM solutions perform best, even better than SQL:2003 standard window function based solutions. Don’t believe it? Read this very interesting article on Oracle offset pagination performance.

So, the optimal solution in Oracle is:

-- PostgreSQL syntax:
SELECT ID, TITLE 
FROM BOOK 
LIMIT 1 OFFSET 2

-- Oracle equivalent:
SELECT *
FROM (
  SELECT b.*, ROWNUM rn
  FROM (
    SELECT ID, TITLE
    FROM BOOK
  ) b
  WHERE ROWNUM <= 3 -- (1 + 2)
)
WHERE rn > 2

So that’s really the equivalent?

Of course not. You’re selecting an additional column, the rn column. You might just not care in most cases, but what if you wanted to make a limited subquery to be used with an IN predicate?

-- PostgreSQL syntax:
SELECT *
FROM BOOK
WHERE AUTHOR_ID IN (
  SELECT ID
  FROM AUTHOR
  LIMIT 1 OFFSET 2
)

-- Oracle equivalent:
SELECT *
FROM BOOK
WHERE AUTHOR_ID IN (
  SELECT * -- Ouch. These are two columns!
  FROM (
    SELECT b.*, ROWNUM rn
    FROM (
      SELECT ID
      FROM AUTHOR
    ) b
    WHERE ROWNUM <= 3
  )
  WHERE rn > 2
)

So, as you can see, you’ll have to do some more sophisticated SQL transformation. If you’re manually emulating LIMIT .. OFFSET, then you might just patch the ID column into the subquery:

SELECT *
FROM BOOK
WHERE AUTHOR_ID IN (
  SELECT ID -- better
  FROM (
    SELECT b.ID, ROWNUM rn -- better
    FROM (
      SELECT ID
      FROM AUTHOR
    ) b
    WHERE ROWNUM <= 3
  )
  WHERE rn > 2
)

So, that’s more like it, right? But since you’re not writing this manually every time, you’re about to start creating your own nifty in-house SQL framework covering the 2-3 use cases that you’ve encountered so far, right?

You can do it. So you’ll regex-search-replace column names automagically to produce the above.

So now, it is correct?

Of course not! Because you can have ambiguous column names in top-level SELECTs, but not in nested selects. What if you want to do this:

-- PostgreSQL syntax:
-- Perfectly valid repetition of two ID columns
SELECT BOOK.ID, AUTHOR.ID
FROM BOOK
JOIN AUTHOR
ON BOOK.AUTHOR_ID = AUTHOR.ID
LIMIT 1 OFFSET 2

-- Oracle equivalent:
SELECT *
FROM (
  SELECT b.*, ROWNUM rn
  FROM (
    -- Ouch! ORA-00918: column ambiguously defined
    SELECT BOOK.ID, AUTHOR.ID
    FROM BOOK
    JOIN AUTHOR
    ON BOOK.AUTHOR_ID = AUTHOR.ID
  ) b
  WHERE ROWNUM <= 3
)
WHERE rn > 2

Nope. And the trick of manually patching ID columns from the previous example doesn’t work, because you have multiple ID instances. And renaming the columns to random values is nasty, because the user of your home-grown in-house database framework wants to receive well-defined column names. I.e. ID and… ID.

So, the solution is to rename the columns twice. Once in each derived table:

-- Oracle equivalent:
-- Rename synthetic column names back to original
SELECT c1 ID, c2 ID
FROM (
  SELECT b.c1, b.c2, ROWNUM rn
  FROM (
    -- synthetic column names here
    SELECT BOOK.ID c1, AUTHOR.ID c2
    FROM BOOK
    JOIN AUTHOR
    ON BOOK.AUTHOR_ID = AUTHOR.ID
  ) b
  WHERE ROWNUM <= 3
)
WHERE rn > 2

But now, we’re done?

Of course not! What if you doubly nest such a query? Will you think about doubly renaming ID columns to synthetic names, and back? … ;-) Let’s leave it here and talk about something entirely different:

Does the same thing work for SQL Server 2008?

Of course not! In SQL Server 2008, the most popular approach is to use window functions. Namely, ROW_NUMBER(). So, let’s consider:

-- PostgreSQL syntax:
SELECT ID, TITLE 
FROM BOOK 
LIMIT 1 OFFSET 2

-- SQL Server equivalent:
SELECT b.*
FROM (
  SELECT ID, TITLE, 
    ROW_NUMBER() OVER (ORDER BY ID) rn
  FROM BOOK
) b
WHERE rn > 2 AND rn <= 3

So that’s it, right?

Of course not! ;-)

OK, we’ve already had this issue. We should not select *, because that would generate too many columns in the case that we’re using this as a subquery for an IN predicate. So let’s consider the correct solution with synthetic column names:

-- SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
  SELECT ID c1, TITLE c2,
    ROW_NUMBER() OVER (ORDER BY ID) rn
  FROM BOOK
) b
WHERE rn > 2 AND rn <= 3

But now we got it, right?

Make an educated guess: Nope!

What happens, if you add an ORDER BY clause to the original query?

-- PostgreSQL syntax:
SELECT ID, TITLE 
FROM BOOK 
ORDER BY SOME_COLUMN
LIMIT 1 OFFSET 2

-- Naive SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
  SELECT ID c1, TITLE c2,
    ROW_NUMBER() OVER (ORDER BY ID) rn
  FROM BOOK
  ORDER BY SOME_COLUMN
) b
WHERE rn > 2 AND rn <= 3

Now, that doesn’t work in SQL Server. Subqueries are not allowed to have an ORDER BY clause, unless they also have a TOP clause (or an OFFSET .. FETCH clause in SQL Server 2012).

OK, we can probably tweak this using TOP 100 PERCENT to make SQL Server happy.

-- Better SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
  SELECT TOP 100 PERCENT
    ID c1, TITLE c2,
    ROW_NUMBER() OVER (ORDER BY ID) rn
  FROM BOOK
  ORDER BY SOME_COLUMN
) b
WHERE rn > 2 AND rn <= 3

Now, that’s correct SQL according to SQL Server, although you do not have a guarantee that the ordering of the derived table will survive after query execution. It may well be that the ordering is changed again by some influence.

If you wanted to order by SOME_COLUMN in the outer query, you’d have to again transform the SQL statement to add another synthetic column:

-- Better SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
  SELECT TOP 100 PERCENT
    ID c1, TITLE c2,
    SOME_COLUMN c99,
    ROW_NUMBER() OVER (ORDER BY ID) rn
  FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
ORDER BY b.c99

That does start getting a bit nasty. And let’s guess whether:

This is the correct solution!

Of course not! What if the original query had DISTINCT in it?

-- PostgreSQL syntax:
SELECT DISTINCT AUTHOR_ID
FROM BOOK 
LIMIT 1 OFFSET 2

-- Naive SQL Server equivalent:
SELECT b.c1 AUTHOR_ID
FROM (
  SELECT DISTINCT AUTHOR_ID c1,
    ROW_NUMBER() OVER (ORDER BY AUTHOR_ID) rn
  FROM BOOK
) b
WHERE rn > 2 AND rn <= 3

Now, what happens if an author has written several books? Yes, the DISTINCT keyword should remove such duplicates, and effectively, the PostgreSQL query will correctly remove duplicates first, and then apply LIMIT and OFFSET.

However, the ROW_NUMBER() predicate always generates distinct row numbers before DISTINCT can remove them again. In other words, DISTINCT has no effect.

Luckily, we can tweak this SQL again, using this neat little trick:

-- Better SQL Server equivalent:
SELECT b.c1 AUTHOR_ID
FROM (
  SELECT DISTINCT AUTHOR_ID c1,
    DENSE_RANK() OVER (ORDER BY AUTHOR_ID) rn
  FROM BOOK
) b
WHERE rn > 2 AND rn <= 3

Read more about this trick here:

SQL Trick: row_number() is to SELECT what dense_rank() is to SELECT DISTINCT.

Watch out that the ORDER BY clause must contain all columns from the SELECT field list. Obviously, this will limit the acceptable columns in the SELECT DISTINCT field list to columns that are allowed in a window function’s ORDER BY clause (e.g. no other window functions).

We could of course try to fix that as well using common table expressions, or we consider

Yet another issue??

Yes, of course!

Do you even know what the column(s) in the window function’s ORDER BY clause should be? Have you just picked any column, at random? What if that column doesn’t have an index on it, will your window function still perform?

The answer is easy when your original SELECT statement also has an ORDER BY clause, then you should probably take that one (plus all the columns from the SELECT DISTINCT clause if applicable).

But what if you don’t have any ORDER BY clause?

Yet another trick! Use a “constant” variable:

-- Better SQL Server equivalent:
SELECT b.c1 AUTHOR_ID
FROM (
  SELECT AUTHOR_ID c1,
    ROW_NUMBER() OVER (ORDER BY @@version) rn
  FROM BOOK
) b
WHERE rn > 2 AND rn <= 3

Yes, you need to use a variable, because constants are not allowed in those ORDER BY clauses, in SQL Server. Painful, I know.

Read more about this @@version trick here.

Are we done yet!?!?

Probably not ;-) But we have probably covered around 99% of the common and edge cases. We can sleep nicely, now.

Note that all of these SQL transformations are implemented in jOOQ. jOOQ is the only SQL abstraction framework that takes SQL seriously (with all its warts and caveats), standardising over all of this madness.

As mentioned in the beginning, with jOOQ, you just write:

// Don't worry about general emulation
select().from(BOOK).limit(1).offset(2);

// Don't worry about duplicate column names
// in subselects
select(BOOK.ID, AUTHOR.ID)
.from(BOOK)
.join(AUTHOR)
.on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
.limit(1).offset(2);

// Don't worry about invalid IN predicates
select()
.from(BOOK)
.where(BOOK.AUTHOR_ID).in(
    select(AUTHOR.ID)
    .from(AUTHOR)
    .limit(1).offset(2)
);

// Don't worry about the ROW_NUMBER() vs.
// DENSE_RANK() distinction
selectDistinct(AUTHOR_ID)
    .from(BOOK).limit(1).offset(2);

With jOOQ, you can just write your Oracle SQL or Transact SQL as if it were as awesome as PostgreSQL! … without jumping the SQL ship entirely, and moving on to JPA.

jOOQ, the best way to write SQL in Java

Keyset paging

Now, of course, if you have been reading our blog, or our partner blog SQL Performance Explained, you should know by now that OFFSET pagination is often a bad choice in the first place. You should know that keyset pagination almost always outperforms OFFSET pagination.

Read about how jOOQ natively supports keyset pagination using the SEEK clause, here.

Popular ORMs Don’t do SQL

I’m contemplating about what has happened in the ISO / IEC SQL standard during the last 15 years. We’ve had quite a few new features added to our beloved SQL language. Check this out:

  • With the ISO/IEC SQL:1999 standard, we could take advantage of grouping sets and (recursive) common table expressions.
  • With the ISO/IEC SQL:2003 standard, we’ve had the very sophisticated window functions and the MERGE statement.
  • With the ISO/IEC SQL:2008 standard, we could perform partitioned JOINs.
  • With the ISO/IEC SQL:2011 standard, we can now interoperate with temporal databases (implemented so far in IBM DB2 and Oracle).

And obviously, there’s much more goodness hiding in the almost unreadable 1423-page-long documents.

But JPA…

Now, does any of these awesome features appear in JPA? Nope. Will the next SQL standard introduce new awesome features? I’m sure it will! I could imagine that the Oracle / CUBRID CONNECT BY clause, or the Oracle / SQL Server PIVOT / UNPIVOT clauses are good candiadates for standardisation. I’d go absolutely nuts if Oracle’s crazy MODEL clause would make it, too.

While exciting things happen at these ends, the ORM impedance mismatch will further deepen and confirm Charles Humble’s recent findings from the QCon, where he’s observed an increasing number of people who are unhappy with popular ORMs’ ever increasing complexity. A complexity example: NamedEntityGraph!

@NamedEntityGraph(
    name="ExecutiveProjects"
    attributeNodes={
        @NamedAttributeNode("address"),
        @NamedAttributeNode(
            value="projects",
            subgraph="projects"
        )
    },
    subgraphs={
        @NamedSubgraph(
            name="projects",
            attributeNodes={
                @NamedAttributeNode("properties")
            }
        ),
        @NamedSubgraph(
            name="projects",
            type=LargeProject.class,
            attributeNodes={
                @NamedAttributeNode("executive")
            }
        )
    }
)

Man, did this really have to be added to JPA? Stack Overflow cannot display that many annotations on a single screen! Well, if this is JEE’s answer to SQL’s recent evolutions, then I’m glad I don’t do too much JEE these days. I’m doing SQL, which is an awesome language if let running loose.

Binary data in SQL, more trivia

Yesterday, I blogged about inlining boolean literals in SQL. This goes on with other data types. Let’s look at BLOBs and BINARY data in general. This is also defined in SQL standards, not in SQL 1992, though:

<binary string literal> ::=
  X <quote> [ <space>... ] 
  [ { <hexit> [ <space>... ] <hexit> [ <space>... ] }... ] <quote>

 
But is that supported as such in all databases? Of course, as usual, the answer goes “NO”. Here’s a little summary to see how binary literals are handled in which database:

SQL Server, Sybase ASE, Sybase SQL Anywhere

-- T-SQL databases treat binary literals as other languages
-- treat hexadecimal numbers
INSERT INTO lob_table VALUES (0x01FF);

DB2

-- DB2 follows the standard, but you may need to use
-- a blob constructor. This is not needed for VARCHAR
-- FOR BIT DATA types
INSERT INTO lob_table VALUES (blob(X'01FF'));

Derby, H2, HSQLDB, Ingres, MySQL, SQLite

-- Most databases follow the standard, especially
-- the Java databases
INSERT INTO lob_table VALUES (X'01FF');

Oracle

-- Oracle Omit's the standard's "X"
INSERT INTO lob_table VALUES (hextoraw('01FF'));

Postgres

-- For some odd reason Postgres, is really off this time
-- using octal representation of bytes in literals.
-- There is also hex encoding but I didn't get that to work
-- The explicit cast is important, though
INSERT INTO lob_table VALUES (E'\\001\\377'::bytea);

SQL incompatibilities: NOT IN and NULL values

This is something where many hours of debugging have been spent in the lives of many SQL developers. The various situations where you can have NULL values in NOT IN predicates and anti-joins. Here’s a typical situation:

with data as (
  select 1 as id from dual union all
  select 2 as id from dual union all
  select 3 as id from dual
)
select * from data
where id not in (1, null)

What do you think this will return? Well, since “dual” indicates an Oracle database, you might say: “an empty result set”. And you would be right for Oracle. In fact, you would be right for any of these databases:

  • DB2
  • Derby
  • H2
  • Ingres
  • Oracle
  • Postgres
  • SQL Server
  • SQLite
  • Sybase

BUT! You would be wrong for any of these ones:

  • HSQLDB
  • MySQL
  • Sybase ASE

Why the discrepancy?

Intuitively, you’d say that all the big ones treat NULL specially in NOT IN predicates, and it is easy to understand, why:

-- This predicate here...
id not in (1, null)

-- Could be seen as equivalent to this one:
id != 1 and id != null

There’s no id that fulfills the above predicate id != null (not even null itself), hence an empty result set. MySQL is known for some strong abuse of SQL standards compliance, so it’s not surprising that they tweaked this syntax as well.

But wait!

HSQLDB 2.0 is one of the most standards-compliant databases out there, could they really have gotten it wrong? Let’s consider the standard: SQL 1992, chapter 8.4 <in predicate>:

<in predicate> ::=
   <row value constructor>
      [ NOT ] IN <in predicate value>

<in predicate value> ::=
   <table subquery>
      | <left paren> <in value list> <right paren>

<in value list> ::=
   <value expression> { <comma> <value expression> }...

 

And then, further down:

2) Let RVC be the <row value constructor> and 
   let IPV be the <in predicate value>.

3) The expression
     RVC NOT IN IPV

   is equivalent to
     NOT ( RVC IN IPV )

4) The expression
     RVC IN IPV

   is equivalent to
     RVC = ANY IPV

 

So in fact, this can be said:

ID NOT IN (1, NULL) is equivalent to
NOT (ID IN (1, NULL)), equivalent to
NOT (ID = ANY(1, NULL)), equivalent to
NOT (ID = 1 OR ID = NULL), equivalent to
NOT (ID = 1) AND NOT (ID = NULL), which is always UNKNOWN

Conclusion

It looks for once, that HSQLDB 2.0 is not standards-compliant in that evaluating the expression inside NOT() before applying NOT() has a different outcome from transforming NOT() into a normalised boolean expression, and then evaluating the expression. For SQL developers, all of this can just mean:

Keep NULL out of NOT IN predicates or be doomed!