DSL.using(configuration) .select(PLAYERS.PLAYER_ID, PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, PLAYERS.SCORE) .from(PLAYERS) .where(PLAYERS.GAME_ID.eq(42)) .orderBy(PLAYERS.SCORE.desc(), PLAYERS.PLAYER_ID.asc()) .seek(949, 15) // This jumps to the tuple (949, 15) .limit(10) .fetch();
Let me explain…It hasn’t been until the recent SQL:2008 standard that what MySQL users know as
LIMIT .. OFFSETwas standardised into the following simple statement:
SELECT * FROM BOOK OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY
LIMIT .. OFFSETclause, which is why we chose that for the jOOQ DSL API In SQL:
SELECT * FROM BOOK LIMIT 1 OFFSET 2
-- MySQL, H2, HSQLDB, Postgres, and SQLite SELECT * FROM BOOK LIMIT 1 OFFSET 2 -- CUBRID supports a MySQL variant of the -- LIMIT .. OFFSET clause SELECT * FROM BOOK LIMIT 2, 1 -- Derby, SQL Server 2012, Oracle 12, SQL:2008 SELECT * FROM BOOK OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY -- Ingres. Eek, almost the standard. Almost! SELECT * FROM BOOK OFFSET 2 FETCH FIRST 1 ROWS ONLY -- Firebird SELECT * FROM BOOK ROWS 2 TO 3 -- Sybase SQL Anywhere SELECT TOP 1 ROWS START AT 3 * FROM BOOK -- DB2 (without OFFSET) SELECT * FROM BOOK FETCH FIRST 1 ROWS ONLY -- Sybase ASE, SQL Server 2008 (without OFFSET) SELECT TOP 1 * FROM BOOK
TOPclause before the
SELECTlist. This is easy to emulate. Now what about:
- Oracle 11g and less
- SQL Server 2008 and less
- DB2 with OFFSET
OFFSET .. FETCHin those older databases. The optimal solutions always involve:
- Using doubly-nested derived tables with
ROWNUMfiltering in Oracle
- Using single-nested derived tabels with
ROW_NUMBER()filtering in SQL Server and DB2
Do you think you will get it right?;-) Let us go through a couple of issues that you may not have thought about. First off, Oracle. Oracle obviously wanted to create a maximum vendor-lockin, which is only exceeded by Apple’s recent introduction of Swift. This is why
ROWNUMsolutions perform best, even better than SQL:2003 standard window function based solutions. Don’t believe it? Read this very interesting article on Oracle offset pagination performance. So, the optimal solution in Oracle is:
-- PostgreSQL syntax: SELECT ID, TITLE FROM BOOK LIMIT 1 OFFSET 2 -- Oracle equivalent: SELECT * FROM ( SELECT b.*, ROWNUM rn FROM ( SELECT ID, TITLE FROM BOOK ) b WHERE ROWNUM <= 3 -- (1 + 2) ) WHERE rn > 2
So that’s really the equivalent?Of course not. You’re selecting an additional column, the
rncolumn. You might just not care in most cases, but what if you wanted to make a limited subquery to be used with an
-- PostgreSQL syntax: SELECT * FROM BOOK WHERE AUTHOR_ID IN ( SELECT ID FROM AUTHOR LIMIT 1 OFFSET 2 ) -- Oracle equivalent: SELECT * FROM BOOK WHERE AUTHOR_ID IN ( SELECT * -- Ouch. These are two columns! FROM ( SELECT b.*, ROWNUM rn FROM ( SELECT ID FROM AUTHOR ) b WHERE ROWNUM <= 3 ) WHERE rn > 2 )
LIMIT .. OFFSET, then you might just patch the
IDcolumn into the subquery:
SELECT * FROM BOOK WHERE AUTHOR_ID IN ( SELECT ID -- better FROM ( SELECT b.ID, ROWNUM rn -- better FROM ( SELECT ID FROM AUTHOR ) b WHERE ROWNUM <= 3 ) WHERE rn > 2 )
So now, it is correct?Of course not! Because you can have ambiguous column names in top-level
SELECTs, but not in nested selects. What if you want to do this:
-- PostgreSQL syntax: -- Perfectly valid repetition of two ID columns SELECT BOOK.ID, AUTHOR.ID FROM BOOK JOIN AUTHOR ON BOOK.AUTHOR_ID = AUTHOR.ID LIMIT 1 OFFSET 2 -- Oracle equivalent: SELECT * FROM ( SELECT b.*, ROWNUM rn FROM ( -- Ouch! ORA-00918: column ambiguously defined SELECT BOOK.ID, AUTHOR.ID FROM BOOK JOIN AUTHOR ON BOOK.AUTHOR_ID = AUTHOR.ID ) b WHERE ROWNUM <= 3 ) WHERE rn > 2
IDinstances. And renaming the columns to random values is nasty, because the user of your home-grown in-house database framework wants to receive well-defined column names. I.e.
ID. So, the solution is to rename the columns twice. Once in each derived table:
-- Oracle equivalent: -- Rename synthetic column names back to original SELECT c1 ID, c2 ID FROM ( SELECT b.c1, b.c2, ROWNUM rn FROM ( -- synthetic column names here SELECT BOOK.ID c1, AUTHOR.ID c2 FROM BOOK JOIN AUTHOR ON BOOK.AUTHOR_ID = AUTHOR.ID ) b WHERE ROWNUM <= 3 ) WHERE rn > 2
But now, we’re done?Of course not! What if you doubly nest such a query? Will you think about doubly renaming
IDcolumns to synthetic names, and back? … ;-) Let’s leave it here and talk about something entirely different:
Does the same thing work for SQL Server 2008?Of course not! In SQL Server 2008, the most popular approach is to use window functions. Namely,
ROW_NUMBER(). So, let’s consider:
-- PostgreSQL syntax: SELECT ID, TITLE FROM BOOK LIMIT 1 OFFSET 2 -- SQL Server equivalent: SELECT b.* FROM ( SELECT ID, TITLE, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
So that’s it, right?Of course not! ;-) OK, we’ve already had this issue. We should not select
*, because that would generate too many columns in the case that we’re using this as a subquery for an
INpredicate. So let’s consider the correct solution with synthetic column names:
-- SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT ID c1, TITLE c2, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
But now we got it, right?Make an educated guess: Nope!
ORDER BYclause to the original query?
-- PostgreSQL syntax: SELECT ID, TITLE FROM BOOK ORDER BY SOME_COLUMN LIMIT 1 OFFSET 2 -- Naive SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT ID c1, TITLE c2, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ORDER BY SOME_COLUMN ) b WHERE rn > 2 AND rn <= 3
ORDER BYclause, unless they also have a
TOPclause (or an
OFFSET .. FETCHclause in SQL Server 2012). OK, we can probably tweak this using
TOP 100 PERCENTto make SQL Server happy.
-- Better SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT TOP 100 PERCENT ID c1, TITLE c2, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ORDER BY SOME_COLUMN ) b WHERE rn > 2 AND rn <= 3
SOME_COLUMNin the outer query, you’d have to again transform the SQL statement to add another synthetic column:
-- Better SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT TOP 100 PERCENT ID c1, TITLE c2, SOME_COLUMN c99, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3 ORDER BY b.c99
This is the correct solution!Of course not! What if the original query had
-- PostgreSQL syntax: SELECT DISTINCT AUTHOR_ID FROM BOOK LIMIT 1 OFFSET 2 -- Naive SQL Server equivalent: SELECT b.c1 AUTHOR_ID FROM ( SELECT DISTINCT AUTHOR_ID c1, ROW_NUMBER() OVER (ORDER BY AUTHOR_ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
DISTINCTkeyword should remove such duplicates, and effectively, the PostgreSQL query will correctly remove duplicates first, and then apply
OFFSET. However, the
ROW_NUMBER()predicate always generates distinct row numbers before
DISTINCTcan remove them again. In other words,
DISTINCThas no effect. Luckily, we can tweak this SQL again, using this neat little trick:
-- Better SQL Server equivalent: SELECT b.c1 AUTHOR_ID FROM ( SELECT DISTINCT AUTHOR_ID c1, DENSE_RANK() OVER (ORDER BY AUTHOR_ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
ORDER BYclause must contain all columns from the
SELECTfield list. Obviously, this will limit the acceptable columns in the
SELECT DISTINCTfield list to columns that are allowed in a window function’s
ORDER BYclause (e.g. no other window functions). We could of course try to fix that as well using common table expressions, or we consider
Yet another issue??Yes, of course! Do you even know what the column(s) in the window function’s
ORDER BYclause should be? Have you just picked any column, at random? What if that column doesn’t have an index on it, will your window function still perform? The answer is easy when your original
SELECTstatement also has an
ORDER BYclause, then you should probably take that one (plus all the columns from the
SELECT DISTINCTclause if applicable). But what if you don’t have any
ORDER BYclause? Yet another trick! Use a “constant” variable:
-- Better SQL Server equivalent: SELECT b.c1 AUTHOR_ID FROM ( SELECT AUTHOR_ID c1, ROW_NUMBER() OVER (ORDER BY @@version) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
ORDER BYclauses, in SQL Server. Painful, I know. Read more about this @@version trick here.
Are we done yet!?!?Probably not ;-) But we have probably covered around 99% of the common and edge cases. We can sleep nicely, now. Note that all of these SQL transformations are implemented in jOOQ. jOOQ is the only SQL abstraction framework that takes SQL seriously (with all its warts and caveats), standardising over all of this madness. As mentioned in the beginning, with jOOQ, you just write:
// Don't worry about general emulation select().from(BOOK).limit(1).offset(2); // Don't worry about duplicate column names // in subselects select(BOOK.ID, AUTHOR.ID) .from(BOOK) .join(AUTHOR) .on(BOOK.AUTHOR_ID.eq(AUTHOR.ID)) .limit(1).offset(2); // Don't worry about invalid IN predicates select() .from(BOOK) .where(BOOK.AUTHOR_ID).in( select(AUTHOR.ID) .from(AUTHOR) .limit(1).offset(2) ); // Don't worry about the ROW_NUMBER() vs. // DENSE_RANK() distinction selectDistinct(AUTHOR_ID) .from(BOOK).limit(1).offset(2);
Keyset pagingNow, of course, if you have been reading our blog, or our partner blog SQL Performance Explained, you should know by now that
OFFSETpagination is often a bad choice in the first place. You should know that keyset pagination almost always outperforms
OFFSETpagination. Read about how jOOQ natively supports keyset pagination using the SEEK clause, here.
-- Get arbitrarily numbered row_numbers SELECT ROW_NUMBER() OVER () -- Skip arbitrary rows SELECT a FROM (VALUES (1), (2), (3), (4)) t(a) OFFSET 3 ROWS
OFFSETexpressions are non-deterministic. Two subsequent executions of the same query might produce different results. But then again, any
ORDER BYclause is non-deterministic, if you do not order by a strictly
UNIQUEexpression, such as a primary key. So, that’s a bit of a pain, because other databases aren’t that strict and after all, you might just not care about explicit ordering for a quick, ad-hoc query, so a “reasonable”, lenient default would be useful.
Constant ORDER BY clauses don’t workYou cannot add a constant
ORDER BYclause to window functions either. I.e.:
-- This doesn't work: SELECT ROW_NUMBER() OVER (ORDER BY 'a') -- But this does! SELECT a FROM (VALUES (1), (2), (3), (4)) t(a) ORDER BY 'a' OFFSET 3 ROWS
ORDER BY 'a'uses a constant
VARCHARexpression, not a numeric one, as that would be generating column-reference-by-index expressions, which would be non-constant in the second example.
Random column references don’t workSo you’re thinking that you can just add a random column reference? Sometimes you can, but often you cannot:
-- This doesn't work: SELECT ROW_NUMBER() OVER ( ORDER BY [no-column-available-here] ) -- But this does! SELECT a FROM (VALUES (1), (2), (3), (4)) t(a) ORDER BY a OFFSET 3 ROWS
ROW_NUMBER()function. At the same time, you can write
ORDER BY ain the second example, but only if
ais a “comparable” value, i.e. not a LOB, such as
image. Besides, as we don’t really care about the actual ordering, is it worth ordering the result set by anything at all? Do you happen to have an index on
Quasi-constant ORDER BY expressions do workSo, to stay on the safe side, if ever you need a dummy
ORDER BYexpression in SQL Server, use a quasi-constant expression, like
@@language, or any of these). The following will always work:
-- This always works: SELECT ROW_NUMBER() OVER (ORDER BY @@version) -- So does this: SELECT a FROM (VALUES (1), (2), (3), (4)) t(a) ORDER BY @@version OFFSET 3 ROWS
ORDER BYclauses that will help you simplify writing vendor-agnostic SQL in these edge-cases, as we believe that you simply shouldn’t think of these things all the time.
Slow OFFSETIn order to understand the Seek Method, let’s first understand what problem it solves: SQL OFFSET clauses are slow. They’re slow for a simple reason. In order to reach a high offset from a result set, all previous records have to be skipped and counted. While a query with no
OFFSETcan be very fast (using MySQL syntax):
SELECT first_name, last_name, score FROM players WHERE game_id = 42 ORDER BY score DESC LIMIT 10;
SELECT first_name, last_name, score FROM players WHERE game_id = 42 ORDER BY score DESC LIMIT 10 OFFSET 100000;
(game_id, score)is indexed, we’ll have to actually traverse the whole index in order to count how many records we’ve already skipped. While this problem can be somewhat lessened by a trick, joining
playersto a derived table, there is an alternative, much faster approach to tackling paging: the Seek Method.
The Seek MethodWhile it is not quite clear who originally invented the Seek Method (some also call it “keyset paging”), a very prominent advocate for it is Markus Winand. He describes the Seek Method on his blog (and in his book): http://use-the-index-luke.com/sql/partial-results/fetch-next-page Essentially, the Seek Method does not skip records before an OFFSET, but it skips records until the last record previously fetched. Think about paging on Google. From a usability point of view, you hardly ever skip exactly 100’000 records. You mostly want to skip to the next page and then again, to the next page, i.e. just past the last record / search result previously fetched. Take the following top 10 players (fake names generated with name generator):
first_name | last_name | score ------------------------------ Mary | Paige | 1098 Tracey | Howard | 1087 Jasmine | Butler | 1053 Zoe | Piper | 1002 Leonard | Peters | 983 Jonathan | Hart | 978 Adam | Morrison | 976 Amanda | Gibson | 967 Alison | Wright | 958 Jack | Harris | 949The above are the first 10 players ordered by score. This can be achieved quite quickly using
LIMIT 10only. Now, when skipping to the next page, you can either just use an
OFFSET 10clause, or you skip all users with a score higher than
SELECT first_name, last_name, score FROM players WHERE game_id = 42 -- Let's call this the "seek predicate" AND score < 949 ORDER BY score DESC LIMIT 10;
first_name | last_name | score ------------------------------ William | Fraser | 947 Claire | King | 945 Jessica | McDonald | 932 ... | ... | ...Note that the previous query assumes that the
scoreis unique within the
playerstable, which is unlikely, of course. If William Fraser also had
949points, just as Jack Harris, the last player on the first page, he would be “lost between pages”. It is thus important to create a non-ambiguous ORDER BY clause and “seek predicate”, by adding an additional unique column:
SELECT player_id, first_name, last_name, score FROM players WHERE game_id = 42 -- assuming 15 is Jack Harris's player_id AND (score, player_id) < (949, 15) ORDER BY score DESC, player_id DESC LIMIT 10;
ORDER BYclause. Here are a couple of possible, alternative configurations:
-- "consistent" ASC and DESC correspond to > and < AND (score, player_id) > (949, 15) ORDER BY score ASC, player_id ASC -- "mixed" ASC and DESC complicate things a bit AND ((score < 949) OR (score = 949 AND player_id > 15)) ORDER BY score DESC, player_id ASC -- The above might be further performance-tweaked AND (score <= 949) AND ((score < 949) OR (score = 949 AND player_id > 15)) ORDER BY score DESC, player_id ASC
ORDER BYclause are nullable,
NULLS LASTmight apply and further complicate the “seek predicate”.
How is this better than OFFSET?The Seek Method allows for avoiding expensive “skip-and-count” operations, replacing them with a simple range scan on an index that might cover the “seek predicate”. Since you’re applying ORDER BY on the columns of the “seek predicate” anyway, you might have already chosen to index them appropriately. While the Seek Method doesn’t improve queries for low page numbers, fetching higher page numbers is significantly faster as proven in this nice benchmark: More interesting feedback on the subject can be found in this reddit.com thread, in which even Tom Kyte himself added a couple of remarks.
A side effect of the Seek MethodA side effect of the Seek Method is the fact that the paging is more “stable”. When you’re about to display page 2 and a new player has reached page 1 in the mean time, or if any player is removed entirely, you will still display the same players on page 2. In other words, when using the Seek Method, there is no guarantee that the first player on page 2 has rank 11. This may or may not be desired. It might be irrelevant on page 10’000, though.
jOOQ 3.3 support for the Seek MethodThe upcoming jOOQ 3.3 (due for late 2013) will include support for the Seek Method on a SQL DSL API level. In addition to jOOQ’s existing LIMIT .. OFFSET support, a “seek predicate” can then be specified through the synthetic
SEEKclause (similar to jOOQ’s synthetic
DSL.using(configuration) .select(PLAYERS.PLAYER_ID, PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, PLAYERS.SCORE) .from(PLAYERS) .where(PLAYERS.GAME_ID.eq(42)) .orderBy(PLAYERS.SCORE.desc(), PLAYERS.PLAYER_ID.asc()) .seek(949, 15) // (!) .limit(10) .fetch();
ORDER BYclause. This appears much more readable than the actual SQL rendered because the “seek predicate” is closer to the
ORDER BYclause where it belongs. Also, jOOQ’s usual row value typesafety is applied here helping you find the right degree / arity and data types for your
SEEKclause. In the above example, the following method calls would not compile in Java:
// Not enough arguments in seek() .orderBy(PLAYERS.SCORE.desc(), PLAYERS.PLAYER_ID.asc()) .seek(949) // Wrong argument types in seek() .orderBy(PLAYERS.SCORE.desc(), PLAYERS.PLAYER_ID.asc()) .seek(949, "abc")
Get to work with the Seek MethodWith native API support for a
SEEKclause, you can get in control of your SQL again and implement high-performing SQL quite easily. Early adopters can already play around with the current state of jOOQ’s 3.3.0 Open Source Edition, which is available on GitHub. And even if you don’t use jOOQ, give the Seek Method a try. You may just have a much faster application afterwards!
While looking for some authoritative information about Sybase SQL Anywhere 12’s
TOP .. START AT clause, I stumbled upon this hilarious white paper here, which I do not want to keep from you:
I will take advantage of “fair use policy” and cite parts from section 7:
Feature number 7: improved support for DaffySQL syntax
If I told you that RowGenerator.row_num contains the values 1 through 255, what would you say this query returned?
Give up? OK, how about this one?
Still stumped? If I told you they both returned exactly the same result set as the following query, what would you say?
Yes, the LIMIT clause is new to SQL Anywhere 12, exactly the same as TOP START AT except it uses zero as the starting point for numbering rows instead of 1.
An “offset”, get it?
As in “Here’s ten dollars, let me count it for you: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.”
Why implement LIMIT? And why include it in a list of cool features?
Because there are a lot of MySQL users out there who don’t have TOP START AT, and they’ve written zillions of queries using LIMIT, and they’d like to migrate their apps to SQL Anywhere without rewriting everything. And PostgreSQL users too… welcome aboard!
Migrating to SQL Anywhere is definitely cool.
So be cool and migrate to SQL Anywhere already! :-) I’m now going through the rest of this fun document.