Faster SQL Pagination with jOOQ Using the Seek Method

Last week, I have blogged about why it is important to stay in control of your SQL, as writing good SQL helps keeping your operations costs down. This is true in many ways and today, we’re going to look into another way to write good, high-performing SQL: Using the “Seek Method”.

Slow OFFSET

In order to understand the Seek Method, let’s first understand what problem it solves: SQL OFFSET clauses are slow. They’re slow for a simple reason. In order to reach a high offset from a result set, all previous records have to be skipped and counted. While a query with no OFFSET can be very fast (using MySQL syntax):

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
ORDER BY score DESC
LIMIT 10;

Skipping to page number 10’000 will be much slower:

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
ORDER BY score DESC
LIMIT 10
OFFSET 100000;

Even if the tuple (game_id, score) is indexed, we’ll have to actually traverse the whole index in order to count how many records we’ve already skipped. While this problem can be somewhat lessened by a trick, joining players to a derived table, there is an alternative, much faster approach to tackling paging: the Seek Method.

The Seek Method

While it is not quite clear who originally invented the Seek Method (some also call it “keyset paging”), a very prominent advocate for it is Markus Winand. He describes the Seek Method on his blog (and in his book): http://use-the-index-luke.com/sql/partial-results/fetch-next-page Essentially, the Seek Method does not skip records before an OFFSET, but it skips records until the last record previously fetched. Think about paging on Google. From a usability point of view, you hardly ever skip exactly 100’000 records. You mostly want to skip to the next page and then again, to the next page, i.e. just past the last record / search result previously fetched. Take the following top 10 players (fake names generated with name generator):
first_name | last_name | score
------------------------------
Mary       | Paige     |  1098
Tracey     | Howard    |  1087
Jasmine    | Butler    |  1053
Zoe        | Piper     |  1002
Leonard    | Peters    |   983
Jonathan   | Hart      |   978
Adam       | Morrison  |   976
Amanda     | Gibson    |   967
Alison     | Wright    |   958
Jack       | Harris    |   949
The above are the first 10 players ordered by score. This can be achieved quite quickly using LIMIT 10 only. Now, when skipping to the next page, you can either just use an OFFSET 10 clause, or you skip all users with a score higher than 949:

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
-- Let's call this the "seek predicate"
AND score < 949
ORDER BY score DESC
LIMIT 10;

This will then give you the players on the next page:
first_name | last_name | score
------------------------------
William    | Fraser    |   947
Claire     | King      |   945
Jessica    | McDonald  |   932
...        | ...       |   ...
Note that the previous query assumes that the score is unique within the players table, which is unlikely, of course. If William Fraser also had 949 points, just as Jack Harris, the last player on the first page, he would be “lost between pages”. It is thus important to create a non-ambiguous ORDER BY clause and “seek predicate”, by adding an additional unique column:

SELECT player_id, first_name, last_name, score
FROM players
WHERE game_id = 42
-- assuming 15 is Jack Harris's player_id
AND (score, player_id) < (949, 15)
ORDER BY score DESC, player_id DESC
LIMIT 10;

Now, the “seek predicate” depends on the ORDER BY clause. Here are a couple of possible, alternative configurations:

-- "consistent" ASC and DESC correspond to > and <
AND (score, player_id) > (949, 15)
ORDER BY score ASC, player_id ASC

-- "mixed" ASC and DESC complicate things a bit
AND ((score < 949)
  OR (score = 949 AND player_id > 15))
ORDER BY score DESC, player_id ASC

-- The above might be further performance-tweaked
AND (score <= 949)
AND ((score < 949)
  OR (score = 949 AND player_id > 15))
ORDER BY score DESC, player_id ASC

If columns in the ORDER BY clause are nullable, NULLS FIRST and NULLS LAST might apply and further complicate the “seek predicate”.

How is this better than OFFSET?

The Seek Method allows for avoiding expensive “skip-and-count” operations, replacing them with a simple range scan on an index that might cover the “seek predicate”. Since you’re applying ORDER BY on the columns of the “seek predicate” anyway, you might have already chosen to index them appropriately. While the Seek Method doesn’t improve queries for low page numbers, fetching higher page numbers is significantly faster as proven in this nice benchmark:
Reproduced from use-the-index-luke.com with permission by Markus Winand
More interesting feedback on the subject can be found in this reddit.com thread, in which even Tom Kyte himself added a couple of remarks.

A side effect of the Seek Method

A side effect of the Seek Method is the fact that the paging is more “stable”. When you’re about to display page 2 and a new player has reached page 1 in the mean time, or if any player is removed entirely, you will still display the same players on page 2. In other words, when using the Seek Method, there is no guarantee that the first player on page 2 has rank 11. This may or may not be desired. It might be irrelevant on page 10’000, though.

jOOQ 3.3 support for the Seek Method

The upcoming jOOQ 3.3 (due for late 2013) will include support for the Seek Method on a SQL DSL API level. In addition to jOOQ’s existing LIMIT .. OFFSET support, a “seek predicate” can then be specified through the synthetic SEEK clause (similar to jOOQ’s synthetic DIVIDE BY clause):

DSL.using(configuration)
   .select(PLAYERS.PLAYER_ID,
           PLAYERS.FIRST_NAME,
           PLAYERS.LAST_NAME,
           PLAYERS.SCORE)
   .from(PLAYERS)
   .where(PLAYERS.GAME_ID.eq(42))
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, 15) // (!)
   .limit(10)
   .fetch();

Instead of explictly phrasing the “seek predicate”, just pass the last record from the previous query, and jOOQ will see that all records before and including this record are skipped, given the ORDER BY clause. This appears much more readable than the actual SQL rendered because the “seek predicate” is closer to the ORDER BY clause where it belongs. Also, jOOQ’s usual row value typesafety is applied here helping you find the right degree / arity and data types for your SEEK clause. In the above example, the following method calls would not compile in Java:

// Not enough arguments in seek()
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949)

// Wrong argument types in seek()
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, "abc")

Get to work with the Seek Method

With native API support for a SEEK clause, you can get in control of your SQL again and implement high-performing SQL quite easily. Early adopters can already play around with the current state of jOOQ’s 3.3.0 Open Source Edition, which is available on GitHub. And even if you don’t use jOOQ, give the Seek Method a try. You may just have a much faster application afterwards!

20 thoughts on “Faster SQL Pagination with jOOQ Using the Seek Method

  1. Thanks for the explanation, it’s really useful!

    If you are coding an “infinite scroll” (like twitter), this is the best approach because you don’t get duplicate records when someone inserts a new record before you navigate to the next page.

  2. Lucas,

    Using this method to create UI as an infinite list or as “current page display with next/prev buttons” is very straightforward and elegant.

    However, let us imaging “true” pagination UI that have a list of page numbers links below/above current page content.

    Somehow we need to get data for these links. Could you please explain what query is necessary to get the list of all “boundary” keys (i. e. keys for the last row on the first page, keys for the last row on the second page, … keys for the last row on the last page) for the given page size N?

    1. Keyset paging / the seek method are inadequate for “true” pagination where page numbers have any meanings. In fact, with keyset paging, it is perfectly possible that page number 1 is not the first page (any longer), if new records are added.

      If you think your underlying data set won’t change while your paginating, then you could use ROW_NUMBER() to calculate “boundary” keys (I guess that’s worth a follow-up blog post):

      -- Using Oracle syntax
      SELECT 
        t.*, 
        DECODE(
          MOD(ROW_NUMBER() OVER(ORDER BY X), N), 0, 1, 0) IS_BOUNDARY
      FROM t
      ORDER BY X
      

      Not sure though, if you’re really that much faster than with “ordinary” pagination, in the end.

      1. Thank you for the prompt reply!

        Yes, the idea was to check the applicability of this technique to a read-mostly data set.

        Regarding performance.

        The query for “boundary” keys may return only rows with boundary keys themselves, rather than all rows with “boundary” flag. So I don’t share your concerns here, navigation from current page to any arbitrary page via SEEK should have same performance characteristics as navigating to adjacent next/prev page and should be faster than OFFSET based.

        It would be really great to see a follow-up blog post on this topic.

        1. I see. Well, for a read-mostly data set, this technique might be well applicable. My concern was that the whole table would need to be loaded once to calculate all the page boundaries. This is as expensive as jumping to the last page with an OFFSET. You’re right, though that you don’t have to transfer all data over the wire. All you need is the records that form the page boundaries, and of those records, you only need the columns that are part of the SEEK / ORDER BY claues.

          If you perform “heavy” paging thereafter (e.g. dozens of requests), you’ll indeed save quite some time using SEEK rather than OFFSET in subsequent queries.

  3. Who invented Seek Method? I’m unsure, we’ve been doing it for decades. KeySet Cursors were one of the 3 types defined in the ANSI SQL-89 spec. They may also have been in SQL-86 too.

    The Seek Method is by far the best approach in the niche where you are query one table AND you have an appropriate index on the column(s) you are sorting & paging on.

    Unfortunately it becomes less effective as you introduce complexity. ie:a Multi-Table join which filters multiple columns from different tables, then sorts the results. These problems required the introduction of surrogate rowID’s. Techniques like Row_Number &/or FETCH-NEXT

    1. Very interesting, I wasn’t aware of such cursor types in ANSI SQL-89. Do you have a link to (free versions) of authoritative documentation? So far, I’ve only found this page on the PostgreSQL wiki.

      What did that cursor syntax look like?

      Why do joins negatively affect keyset pagination? I’d imagine it works just the same. Keyset pagination is just a regular predicate on a driving index, regardless of the amount of additional joins…

  4. Would the following query be considered as using seek method and efficient

    select post.*  
    from post 
    where created_by = ? 
    and id > ? 
    order by post.id asc 
    limit ?
    
  5. Is there any real use of seek method? I ask this because in a seek method you don’t find a random page, instead you navigate through pages one by one. I hardly see an scenario where you go farther than fifth page. People don’t get to 100th page one by one. So what is the real benefit of the seek method?

    1. The “seek” method (a.k.a. keyset pagination) is what the twitter timeline does, for instance. What you refer to is commonly called offset pagination, e.g. going to the 100th page (“100th page offset”).

      People just happen to use offset pagination, when keyset pagination would work much better.

  6. One thing I’m having trouble wrapping my head around: Keeping to the “Jack Harris example”; if you’re sorting by the score field, isn’t it possible that players with scores lower than 949 will have higher player_ids than 15 and thus get omitted from later “pages”? (in (score, player_id) < (949, 15))

    Obviously I'm missing something, I'd be grateful for having it pointed out to me :).

    Thanks for the article!

    1. The important thing here is to understand how the row comparison predicate works. The following two predicates are equivalent:

      // Row predicate
      (A, B) < (X, Y)
      
      // Equivalent "expanded" predicate
      A < X OR (A = X AND B < Y)
      

      As you can see, B < Y is only taken into consideration when NOT(A < X) (let’s ignore nulls). So, if a player has a score lower than 949, then we won’t compare player_ids. Only if players have a score of also 949, we’ll compare player_ids.

      It makes total sense if you think of the < operator in terms of SQL’s ORDER BY clause. When you write:

      ORDER BY A, B
      
      
      Then you're doing exactly the same thing. You're ordering by A first, ignoring B. Only if you have two identical values for A, then you order those rows by B.
      
      Hope this makes more sense now?
      1. Ah, I misunderstood the predicate, thought it expanded to an AND condition. Thanks for the explanation!

  7. SQL standard way to do this is via row wise comparison.

    select * from foo
    where (a,b,c) > (1,2,3)
    order by a,b,c limit n

    Many databases unfortunately do not support or properly optimize this syntax through. postgres does :D

    1. Yeah, PostgreSQL is optimal here. Anyway, the suggestion was to introduce a syntax that goes *after* ORDER BY, because such a syntax will also work with alternating ASC / DESC ordering, unlike the row comparison syntax.

  8. Yeah, I used the seek method many moons ago (1999) to do paging. First used it on an API that we wrote for the Exel Computer System using java libraries. I tend to think everything through using set theory and then apply that using SQL…that way I don’t think about vendor specific methods and it makes it easier to port at a later point.

    1. That’s a very good approach (set theory first, then SQL), not seen very often these days. What’s your secret? Did you ever blog about your methodology?

Leave a Reply