Faster SQL Pagination with Keysets, Continued

A while ago, I have blogged about how to perform keyset pagination (some also call this the “seek method”). Keyset pagination is a very powerful technique to perform constant-time pagination also on very large result sets, where “classic” OFFSET pagination will inevitably get slow on large page numbers.

Keyset pagination is most useful for lazy loading the “next page”, much more than for jumping to page 235 directly, for instance. A good example where this is useful are social media, such as Twitter or Facebook with their streams of content. When you reach the bottom of a page, you just want the “next couple of tweets”. Not the tweets at offset 3564.

If your data being paginated on stays “reasonably unchanged” while paginating, you can even perform classical OFFSET pagination by pre-calculating (and caching) all page boundaries in one go. Imagine, you have this data (I’m using PostgreSQL for the example):

create table t(
  id int,
  value int
);

insert into t (id, value)
select id, random() * 100
from generate_series(1, 1000) x(id);

Our data has 1000 records with random values between 0 and 99. Now, let’s browse this data in ascending order, ordered by VALUE and then by ID. If we wanted to get to page 6 with page sizes of 5 with OFFSET pagination, we’d write:

select id, value
from t
order by value, id
limit 5
offset 25

This would then yield something like

|  ID | VALUE |
|-----|-------|
| 640 |     2 |
| 776 |     2 |
| 815 |     2 |
| 947 |     2 |
|  37 |     3 |

See also this SQLFiddle

OFFSET pagination emulation with cached page boundaries

The above can be emulated using keyset pagination if we know the page boundaries of every page. In other words, in order to jump to page 6 without an actual OFFSET, we’d have to know the value of the record immediately preceding {"ID":640,"VALUE":2}. Or better, let’s just figure out all the page boundaries with the following query:

select 
  t.id, 
  t.value,
  case row_number() 
       over(order by t.value, t.id) % 5 
    when 0 then 1 
    else 0 
  end page_boundary
from t
order by t.value, t.id

The above query yields

|   ID | VALUE | PAGE_BOUNDARY |
|------|-------|---------------|
|  ... |   ... |           ... |
|  474 |     2 |             0 |
|  533 |     2 |             1 | <-- Before page 6
|  640 |     2 |             0 |
|  776 |     2 |             0 |
|  815 |     2 |             0 |
|  947 |     2 |             0 |
|   37 |     3 |             1 | <-- Last on page 6
|  287 |     3 |             0 |
|  450 |     3 |             0 |
|  ... |   ... |           ... |

See also this SQL Fiddle

As you can see, the record just before {"ID":640,"VALUE":2} is {"ID":533,"VALUE":2}, which is the page boundary that we need to jump to if we want to go to page 6. Page 7 then starts with the record just after {"ID":37,"VALUE":3}.

The above query selects too much data, of course. We’re only interested in those records where PAGE_BOUNDARY = 1. Besides, why not calculate the page numbers already in the database with yet another call to ROW_NUMBER(). Let’s write:

select 
  x.value, 
  x.id,
  row_number() 
    over(order by x.value, x.id) + 1 page_number
from (
  select 
    t.id, 
    t.value,
    case row_number() 
         over(order by t.value, t.id) % 5 
      when 0 then 1 
      else 0 
    end page_boundary
  from t
  order by t.value, t.id
) x
where x.page_boundary = 1

This will then yield:

| VALUE |  ID | PAGE_NUMBER |
|-------|-----|-------------|
|     0 | 786 |           2 |
|     1 | 250 |           3 |
|     1 | 959 |           4 |
|     2 | 229 |           5 |
|     2 | 533 |           6 | <-- Before page 6
|     3 |  37 |           7 |
|     3 | 768 |           8 |

See also this SQLFiddle.

We can now jump to page 6 with this simple query:

select id, value
from t
where (value, id) > (2, 533)
order by value, id
limit 5

… which will yield the same as the previous OFFSET query:

|  ID | VALUE |
|-----|-------|
| 640 |     2 |
| 776 |     2 |
| 815 |     2 |
| 947 |     2 |
|  37 |     3 |

See also this SQLFiddle

If you’re planning on using the upcoming jOOQ 3.3, the same query can be achieved with the following SEEK syntax:

DSL.using(configuration)
   .select(T.ID, T.VALUE)
   .from(T)
   .orderBy(T.VALUE, T.ID)
   .seek(2, 533)
   .limit(5);

The advantage of this is that you don’t have to write out the SEEK predicate explicitly, which adds readability and typesafety, specifically if your ORDER BY clause is a little more complex

If window functions aren’t available

The above queries make use of window functions, which aren’t available in all databases, unfortunately. If you’re using MySQL, for instance, you will have to use a different mechanism to find the PAGE_BOUNDARY. One such example us using a scalar subquery:

select 
  t.id, 
  t.value,
  case (
      select count(*)
      from t t2
      where (t2.value, t2.id) <= (t.value, t.id)
    ) % 5 
    when 0 then 1 
    else 0 
  end page_boundary
from t
order by t.value, t.id

See also this SQLFiddle

Such a scalar subquery might be quite costly if your database performs poor query optimisation. Your best bet would be to measure things and decide whether caching page boundaries to be able to apply keyset pagination is really faster than classic OFFSET paging.

Conclusion

As explained in our previous blog post about keyset pagination, this technique can bring great performance improvements as pagination can be achieved in constant time, leveraging existing indexes. It is most useful when the underlying data is very stable (no records added / removed while paginating), or when pagination “stability” is desired even if records are added / removed.

Keyset pagination, or the “seek method”, should be part of every SQL developer’s tool set.

Faster SQL Pagination with jOOQ Using the Seek Method

Last week, I have blogged about why it is important to stay in control of your SQL, as writing good SQL helps keeping your operations costs down. This is true in many ways and today, we’re going to look into another way to write good, high-performing SQL: Using the “Seek Method”.

Slow OFFSET

In order to understand the Seek Method, let’s first understand what problem it solves: SQL OFFSET clauses are slow. They’re slow for a simple reason. In order to reach a high offset from a result set, all previous records have to be skipped and counted. While a query with no OFFSET can be very fast (using MySQL syntax):

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
ORDER BY score DESC
LIMIT 10;

Skipping to page number 10’000 will be much slower:

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
ORDER BY score DESC
LIMIT 10
OFFSET 100000;

Even if the tuple (game_id, score) is indexed, we’ll have to actually traverse the whole index in order to count how many records we’ve already skipped. While this problem can be somewhat lessened by a trick, joining players to a derived table, there is an alternative, much faster approach to tackling paging: the Seek Method.

The Seek Method

While it is not quite clear who originally invented the Seek Method (some also call it “keyset paging”), a very prominent advocate for it is Markus Winand. He describes the Seek Method on his blog (and in his book):

http://use-the-index-luke.com/sql/partial-results/fetch-next-page

Essentially, the Seek Method does not skip records before an OFFSET, but it skips records until the last record previously fetched. Think about paging on Google. From a usability point of view, you hardly ever skip exactly 100’000 records. You mostly want to skip to the next page and then again, to the next page, i.e. just past the last record / search result previously fetched. Take the following top 10 players (fake names generated with name generator):

first_name | last_name | score
------------------------------
Mary       | Paige     |  1098
Tracey     | Howard    |  1087
Jasmine    | Butler    |  1053
Zoe        | Piper     |  1002
Leonard    | Peters    |   983
Jonathan   | Hart      |   978
Adam       | Morrison  |   976
Amanda     | Gibson    |   967
Alison     | Wright    |   958
Jack       | Harris    |   949

The above are the first 10 players ordered by score. This can be achieved quite quickly using LIMIT 10 only. Now, when skipping to the next page, you can either just use an OFFSET 10 clause, or you skip all users with a score higher than 949:

SELECT first_name, last_name, score
FROM players
WHERE game_id = 42
-- Let's call this the "seek predicate"
AND score < 949
ORDER BY score DESC
LIMIT 10;

This will then give you the players on the next page:

first_name | last_name | score
------------------------------
William    | Fraser    |   947
Claire     | King      |   945
Jessica    | McDonald  |   932
...        | ...       |   ...

Note that the previous query assumes that the score is unique within the players table, which is unlikely, of course. If William Fraser also had 949 points, just as Jack Harris, the last player on the first page, he would be “lost between pages”. It is thus important to create a non-ambiguous ORDER BY clause and “seek predicate”, by adding an additional unique column:

SELECT player_id, first_name, last_name, score
FROM players
WHERE game_id = 42
-- assuming 15 is Jack Harris's player_id
AND (score, player_id) < (949, 15)
ORDER BY score DESC, player_id DESC
LIMIT 10;

Now, the “seek predicate” depends on the ORDER BY clause. Here are a couple of possible, alternative configurations:

-- "consistent" ASC and DESC correspond to > and <
AND (score, player_id) > (949, 15)
ORDER BY score ASC, player_id ASC

-- "mixed" ASC and DESC complicate things a bit
AND ((score < 949)
  OR (score = 949 AND player_id > 15))
ORDER BY score DESC, player_id ASC

-- The above might be further performance-tweaked
AND (score <= 949)
AND ((score < 949)
  OR (score = 949 AND player_id > 15))
ORDER BY score DESC, player_id ASC

If columns in the ORDER BY clause are nullable, NULLS FIRST and NULLS LAST might apply and further complicate the “seek predicate”.

How is this better than OFFSET?

The Seek Method allows for avoiding expensive “skip-and-count” operations, replacing them with a simple range scan on an index that might cover the “seek predicate”. Since you’re applying ORDER BY on the columns of the “seek predicate” anyway, you might have already chosen to index them appropriately.

While the Seek Method doesn’t improve queries for low page numbers, fetching higher page numbers is significantly faster as proven in this nice benchmark:

Reproduced from use-the-index-luke.com with permission by Markus Winand

More interesting feedback on the subject can be found in this reddit.com thread, in which even Tom Kyte himself added a couple of remarks.

A side effect of the Seek Method

A side effect of the Seek Method is the fact that the paging is more “stable”. When you’re about to display page 2 and a new player has reached page 1 in the mean time, or if any player is removed entirely, you will still display the same players on page 2. In other words, when using the Seek Method, there is no guarantee that the first player on page 2 has rank 11.

This may or may not be desired. It might be irrelevant on page 10’000, though.

jOOQ 3.3 support for the Seek Method

The upcoming jOOQ 3.3 (due for late 2013) will include support for the Seek Method on a SQL DSL API level. In addition to jOOQ’s existing LIMIT .. OFFSET support, a “seek predicate” can then be specified through the synthetic SEEK clause (similar to jOOQ’s synthetic DIVIDE BY clause):

DSL.using(configuration)
   .select(PLAYERS.PLAYER_ID,
           PLAYERS.FIRST_NAME,
           PLAYERS.LAST_NAME,
           PLAYERS.SCORE)
   .from(PLAYERS)
   .where(PLAYERS.GAME_ID.eq(42))
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, 15) // (!)
   .limit(10)
   .fetch();

Instead of explictly phrasing the “seek predicate”, just pass the last record from the previous query, and jOOQ will see that all records before and including this record are skipped, given the ORDER BY clause.

This appears much more readable than the actual SQL rendered because the “seek predicate” is closer to the ORDER BY clause where it belongs. Also, jOOQ’s usual row value typesafety is applied here helping you find the right degree / arity and data types for your SEEK clause. In the above example, the following method calls would not compile in Java:

// Not enough arguments in seek()
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949)

// Wrong argument types in seek()
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, "abc")

Get to work with the Seek Method

With native API support for a SEEK clause, you can get in control of your SQL again and implement high-performing SQL quite easily. Early adopters can already play around with the current state of jOOQ’s 3.3.0 Open Source Edition, which is available on GitHub.

And even if you don’t use jOOQ, give the Seek Method a try. You may just have a much faster application afterwards!