## How to Find the Closest Subset Sum with SQL

I’ve stumbled upon this very interesting question on Stack Overflow, recently. Its title is:

[How to] compare a number with sum of subset of numbers

In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:

### The problem

The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:

```ID  ASSIGN_AMT
--------------
1        25150
2        19800
3        27511
```

And…

```ID  WORK_AMT
------------
1       7120
2       8150
3       8255
4       9051
5       1220
6      12515
7      13555
8       5221
9        812
10      6562
```

The `ASSIGN_AMT` value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of `WORK_AMT` values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:

1. The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by `ID`. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold
2. The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.

The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique!

But let’s look at the simpler understanding first:

## 1. Calculating a sum of an ordered subset of values

By strictly ordered we mean (in the sense of the question) that we want to order all `WORK_AMT` values, e.g. by `ID`, and allow only for sums that can appear in this particular order. I.e.

```ID  WORK_AMT  SUBSET_SUM
------------------------
1       7120        7120 (= WORK_AMT)
2       8150       15270 (= previous SUBSET_SUM + WORK_AMT)
3       8255       23525 (= previous SUBSET_SUM + WORK_AMT)
4       9051       32576 (= ...)
5       1220       33796
6      12515       46311
7      13555       59866
8       5221       65087
9        812       65899
10      6562       72461
```

The above `SUBSET_SUM` value is the sum of all `WORK_AMT` values with `ID <= "current" ID`. We’ve seen this before on this blog, this is called a running total, and it is best calculated using window functions as follows:

```WITH
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
)
SELECT
ID,
WORK_AMT,
SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM
FROM
VALS
ORDER BY
ID
```

The above window function calculates the sum of all `WORK_AMT` values that are in the “window” of values where the `ID` is smaller or equal to the current `ID`.

### Finding the “closest” of these sums with quantified comparison predicates

Now, the task at hand is to find for each value `ASSIGN_AMT` in `25150`, `19800`, and `27511` the closest value of `SUBSET_SUM`. In a way, what we are trying to do is we want to minimise the expression `ABS(SUBSET_SUM - ASSIGN_AMT)`.

However, the `MIN()` aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of `SUBSET_SUM` that produces this difference in the first place.

One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:

```-- The common table expressions are the same as
-- in the previous examples
WITH
ASSIGN(ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
),
SUMS (ID, WORK_AMT, SUBSET_SUM) AS (
SELECT
VALS.*,
SUM (WORK_AMT) OVER (ORDER BY ID)
FROM
VALS
)

-- This is now the interesting part, where we
-- calculate the closest sum
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM)
FROM
SUMS
)
```

The quantified comparison predicate reads intuitively. We’re looking for that specific `SUBSET_SUM` whose difference to the “expected” `ASSIGN_AMT` is smaller or equal to all the other possible differences.

The above query yields:

```ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525
```

In this case, it’s always the same.

You may have noticed that the solution is not entirely correct in the event where `WORK_AMT` is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the `JOIN`. This can be achieved when replacing:

```        UNION ALL SELECT 4 , 0     FROM DUAL
```

Now, the result is:

```ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
2        19800       23525
3        27511       23525
```

One solution is to remove those duplicates using `DISTINCT` (which is an anti-pattern. See #6 in this list). A better solution is to make the `JOIN` predicate unambiguous by comparing `ID` values as well, i.e.:

```SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
(ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM),
ID
FROM
SUMS
)
```

The above unfortunately doesn’t work in Oracle (yet), which will report an error:

```ORA-01796: this operator cannot be used with lists
```

Oracle supports comparing tuples / row value expressions only with equal comparators, not with less than / greater than comparators – which is a shame. The same query runs smoothlessly in PostgreSQL.

### Finding the “closest” of these sums with Oracle’s FIRST function

Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:

```SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT
```

Essentially, we’re grouping all values from the `SUMS` table for each `ASSIGN_AMT`. For each of those groups, we’ll look for the `"FIRST"` row that appears when ordering rows within the group by our criteria `ABS(ASSIGN_AMT - SUBSET_SUM)`. We `"KEEP"` only those rows in the group, and retain from those rows the minimum `SUBSET_SUM`.

This query will again yield:

```ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525
```

This is an extremely nice functionality that can come in handy every once in a while.

Remember that we’ve seen a similar feature recently on this blog, when we were looking for `FIRST_VALUE` (or `LAST_VALUE`) within the `PARTITION` of a window. In standard SQL, a similar thing can be achieved using window functions as such:

```SELECT DISTINCT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
FIRST_VALUE (SUBSET_SUM) OVER (
PARTITION BY ASSIGN.ID
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS
```

Unfortunately, these solutions all produce duplicates, which we have to remove either via `GROUP BY` (`KEEP` solution), or via `DISTINCT` (`FIRST_VALUE` solution).

### Finding the “closest” of these sums with LATERAL JOIN

A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new `FETCH FIRST` clause along with `CROSS JOIN LATERAL` (or `CROSS APPLY`, which is the same)

```SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN LATERAL (
SELECT
SUBSET_SUM AS CLOSEST_SUM
FROM
SUMS
ORDER BY
ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM)
FETCH FIRST 1 ROW ONLY
) SUMS
```

What does this mean? We’re essentially joining for each value in `ASSIGN` only the `FIRST 1 ROW` in `SUMS`, ordered by the usual criteria. We need `LATERAL` (or `APPLY`), because this allows us to reference columns from the left side of the `JOIN` expression also in the right side, which wouldn’t be possible otherwise.

The same functionality is supported in SQL Server (only using `CROSS APPLY`), or in PostgreSQL (only using `CROSS JOIN LATERAL`).

Lateral can be very useful whenever the right hand side of a `JOIN` depends on the left hand side. Unlike ordinary joins, this means that the `JOIN` order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with `ORDER BY` and `FETCH FIRST`), or when joining unnested table-valued functions, which we’ll cover in a future blog post.

## 2. Calculating a sum of any subset of values

So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of `WORK_AMT` values.

We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:

In recursive SQL, we need a `UNION ALL` query in a common table expression (`WITH` clause in Oracle, or `WITH RECURSIVE` clause in PostgreSQL). The first subquery of `UNION ALL` generates the “seed row(s)” of the recursion, and the second subqeury of `UNION ALL` generates the recursion based on a `SELECT` that selects from the table being declared, recursively.

So, a naive solution to this subset sum problem can be seen here:

```-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID) AS (
SELECT
WORK_AMT,
ID
FROM
WORK

UNION ALL

SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)

-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT
```

The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in `WORK`:

```        SELECT
WORK_AMT,
ID
FROM
WORK
```

In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (`SUMS`) with all the remaining values (`WORK`), i.e. all the values that have a higher `ID`:

```        SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
```

In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative.

The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the `SUMS` table.

After a still acceptable 0.112s for 10 different `WORK_AMT` values, the database calculated:

```ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25133
2        19800        19768
3        27511        27488
```

But beware, as soon as you start adding values to the `VALS` table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightly bigger `WORK` table already requires 16.3 seconds to yield a result:

```    WORK(ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
UNION ALL SELECT 11, 1234  FROM DUAL
UNION ALL SELECT 12, 61    FROM DUAL
UNION ALL SELECT 13, 616   FROM DUAL
UNION ALL SELECT 14, 2456  FROM DUAL
UNION ALL SELECT 15, 5161  FROM DUAL
UNION ALL SELECT 16, 414   FROM DUAL
UNION ALL SELECT 17, 516   FROM DUAL
UNION ALL SELECT 18, 617   FROM DUAL
UNION ALL SELECT 19, 146   FROM DUAL
),
```

And the result would be:

```ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25150
2        19800        19800
3        27511        27511
```

Want proof about the actual sum? That’s easy as well with recursive SQL:

```-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID, CALC) AS (
SELECT
WORK_AMT,
ID,
TO_CHAR(WORK_AMT)
FROM WORK

UNION ALL

SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID,
CALC || '+' || WORK_AMT
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)

-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM,
MIN (CALC) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CALCULATION
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT
```

The above now yields:

```ID  ASSIGN_AMT  CLOSEST_SUM  CALCULATION
------------------------------------------------------------
1        25150        25133  7120 + 8150 + 9051 + 812
2        19800        19768  1220 + 12515 + 5221 + 812
3        27511        27488  8150 + 8255 + 9051 + 1220 + 812
```

## Conclusion

SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features:

• Window functions
• `KEEP FIRST` (in Oracle only)
• `LATERAL JOIN` (or `APPLY`
• Recursive SQL

We’d like to point out that all of these solutions can be written in Java using jOOQ in a type safe manner as well. Finally, a lot of this grounds is covered in more detail in any of the following articles. Enjoy learning, and enjoy SQL!

## The Difference Between ROW_NUMBER(), RANK(), and DENSE_RANK()

There was SQL before window functions and SQL after window functions

If you’re lucky enough to be using any of these databases, then you can use window functions yourself:

• CUBRID
• DB2
• Firebird
• H2
• Informix
• MySQL
• Oracle
• PostgreSQL
• SQLite
• SQL Server
• Sybase SQL Anywhere

(source here)

One of the most obvious and useful set of window functions are ranking functions where rows from your result set are ranked according to a certain scheme. There are three ranking functions:

• `ROW_NUMBER()`
• `RANK()`
• `DENSE_RANK()`

The difference is easy to remember. For the examples, let’s assume we have this table (using PostgreSQL syntax):

```CREATE TABLE t(v) AS
SELECT * FROM (
VALUES('a'),('a'),('a'),('b'),
('c'),('c'),('d'),('e')
) t(v)
```

ROW_NUMBER()

… assigns unique numbers to each row within the `PARTITION` given the `ORDER BY` clause. So you’d get:

```SELECT v, ROW_NUMBER() OVER()
FROM t
```

Note that some SQL dialects (e.g. SQL Server) require an explicit `ORDER BY` clause in the `OVER()` clause:

```SELECT v, ROW_NUMBER() OVER(ORDER BY v)
FROM t
```

The above query returns:

```| V | ROW_NUMBER |
|---|------------|
| a |          1 |
| a |          2 |
| a |          3 |
| b |          4 |
| c |          5 |
| c |          6 |
| d |          7 |
| e |          8 |
```

RANK()

… behaves like `ROW_NUMBER()`, except that “equal” rows are ranked the same. If we substitute `RANK()` into our previous query:

```SELECT v, RANK() OVER(ORDER BY v)
FROM t
```

… then the result we’re getting is this:

```| V | RANK |
|---|------|
| a |    1 |
| a |    1 |
| a |    1 |
| b |    4 |
| c |    5 |
| c |    5 |
| d |    7 |
| e |    8 |
```

As you can see, much like in a sports ranking, we have gaps between the different ranks. We can avoid those gaps by using

DENSE_RANK()

Trivially, `DENSE_RANK()` is a rank with no gaps, i.e. it is “dense”. We can write:

```SELECT v, DENSE_RANK() OVER(ORDER BY v)
FROM t
```

… to obtain

```| V | DENSE_RANK |
|---|------------|
| a |          1 |
| a |          1 |
| a |          1 |
| b |          2 |
| c |          3 |
| c |          3 |
| d |          4 |
| e |          5 |
```

One interesting aspect of `DENSE_RANK()` is the fact that it “behaves like” `ROW_NUMBER()` when we add the `DISTINCT` keyword.

```SELECT DISTINCT v, DENSE_RANK() OVER(ORDER BY v)
FROM t
```

… to obtain

```| V | DENSE_RANK |
|---|------------|
| a |          1 |
| b |          2 |
| e |          5 |
| d |          4 |
| c |          3 |
```

In fact, `ROW_NUMBER()` prevents you from using `DISTINCT`, because `ROW_NUMBER()` generates unique values across the partition before `DISTINCT` is applied:

```SELECT DISTINCT v, ROW_NUMBER() OVER(ORDER BY v)
FROM t
ORDER BY 1, 2
```

`DISTINCT` has no effect:

```| V | ROW_NUMBER |
|---|------------|
| a |          1 |
| a |          2 |
| a |          3 |
| b |          4 |
| c |          5 |
| c |          6 |
| d |          7 |
| e |          8 |
```

## Putting it all together

A good way to understand the three ranking functions is to see them all in action side-by-side. Run this query

```SELECT
v,
ROW_NUMBER() OVER(ORDER BY v),
RANK()       OVER(ORDER BY v),
DENSE_RANK() OVER(ORDER BY v)
FROM t
ORDER BY 1, 2
```

… or this one (using the SQL standard `WINDOW` clause, to reuse window specifications):

```SELECT
v,
ROW_NUMBER() OVER(w),
RANK()       OVER(w),
DENSE_RANK() OVER(w)
FROM t
WINDOW w AS (ORDER BY v)
```

… to obtain:

```| V | ROW_NUMBER | RANK | DENSE_RANK |
|---|------------|------|------------|
| a |          1 |    1 |          1 |
| a |          2 |    1 |          1 |
| a |          3 |    1 |          1 |
| b |          4 |    4 |          2 |
| c |          5 |    5 |          3 |
| c |          6 |    5 |          3 |
| d |          7 |    7 |          4 |
| e |          8 |    8 |          5 |
```

Note that unfortunately, the `WINDOW` clause is not supported in all databases.

## SQL is awesome

These things can be written very easily using SQL window functions. Once you get a hang of the syntax, you won’t want to miss this killer feature in your every day SQL statements any more. Excited? ## Stop Trying to Emulate SQL OFFSET Pagination with Your In-House DB Framework!

I’m pretty sure you’ve gotten it wrong in numerous ways, so far. And you probably won’t get it right any time soon. So why waste your precious time on SQL tweaking, when you could be implementing business logic?

## Let me explain…

It hasn’t been until the recent SQL:2008 standard that what MySQL users know as `LIMIT .. OFFSET` was standardised into the following simple statement:

```SELECT *
FROM BOOK
OFFSET 2 ROWS
FETCH NEXT 1 ROWS ONLY
```

Yes. So many keywords. SQL is indeed a very verbose language. Personally, we really like the conciseness of MySQL’s / PostgreSQL’s `LIMIT .. OFFSET` clause, which is why we chose that for the jOOQ DSL API

In SQL:

```SELECT * FROM BOOK LIMIT 1 OFFSET 2
```

In jOOQ:

```select().from(BOOK).limit(1).offset(2);
```

Now, when you’re a SQL framework vendor, or when you’re rolling your own, in-house SQL abstraction, you might think about standardising this neat little clause. Here’s a couple of flavours from databases that natively support offset pagination:

```-- MySQL, H2, HSQLDB, Postgres, and SQLite
SELECT * FROM BOOK LIMIT 1 OFFSET 2

-- CUBRID supports a MySQL variant of the
-- LIMIT .. OFFSET clause
SELECT * FROM BOOK LIMIT 2, 1

-- Derby, SQL Server 2012, Oracle 12, SQL:2008
SELECT * FROM BOOK
OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY

-- Ingres. Eek, almost the standard. Almost!
SELECT * FROM BOOK
OFFSET 2 FETCH FIRST 1 ROWS ONLY

-- Firebird
SELECT * FROM BOOK ROWS 2 TO 3

-- Sybase SQL Anywhere
SELECT TOP 1 ROWS START AT 3 * FROM BOOK

-- DB2 (without OFFSET)
SELECT * FROM BOOK FETCH FIRST 1 ROWS ONLY

-- Sybase ASE, SQL Server 2008 (without OFFSET)
SELECT TOP 1 * FROM BOOK
```

So far, so good. These can all be handled. Some databases put offsets before limits, others put limits before offsets, and the T-SQL family puts the whole `TOP` clause before the `SELECT` list. This is easy to emulate. Now what about:

• Oracle 11g and less
• SQL Server 2008 and less
• DB2 with OFFSET

When you google for this, you will find millions of ways to emulate `OFFSET .. FETCH` in those older databases. The optimal solutions always involve:

• Using doubly-nested derived tables with `ROWNUM` filtering in Oracle
• Using single-nested derived tabels with `ROW_NUMBER()` filtering in SQL Server and DB2

So you’re emulating it. ## Do you think you will get it right?

;-)

Let us go through a couple of issues that you may not have thought about.

First off, Oracle. Oracle obviously wanted to create a maximum vendor-lockin, which is only exceeded by Apple’s recent introduction of Swift. This is why `ROWNUM` solutions perform best, even better than SQL:2003 standard window function based solutions. Don’t believe it? Read this very interesting article on Oracle offset pagination performance.

So, the optimal solution in Oracle is:

```-- PostgreSQL syntax:
SELECT ID, TITLE
FROM BOOK
LIMIT 1 OFFSET 2

-- Oracle equivalent:
SELECT *
FROM (
SELECT b.*, ROWNUM rn
FROM (
SELECT ID, TITLE
FROM BOOK
) b
WHERE ROWNUM <= 3 -- (1 + 2)
)
WHERE rn > 2
```

### So that’s really the equivalent?

Of course not. You’re selecting an additional column, the `rn` column. You might just not care in most cases, but what if you wanted to make a limited subquery to be used with an `IN` predicate?

```-- PostgreSQL syntax:
SELECT *
FROM BOOK
WHERE AUTHOR_ID IN (
SELECT ID
FROM AUTHOR
LIMIT 1 OFFSET 2
)

-- Oracle equivalent:
SELECT *
FROM BOOK
WHERE AUTHOR_ID IN (
SELECT * -- Ouch. These are two columns!
FROM (
SELECT b.*, ROWNUM rn
FROM (
SELECT ID
FROM AUTHOR
) b
WHERE ROWNUM <= 3
)
WHERE rn > 2
)
```

So, as you can see, you’ll have to do some more sophisticated SQL transformation. If you’re manually emulating `LIMIT .. OFFSET`, then you might just patch the `ID` column into the subquery:

```SELECT *
FROM BOOK
WHERE AUTHOR_ID IN (
SELECT ID -- better
FROM (
SELECT b.ID, ROWNUM rn -- better
FROM (
SELECT ID
FROM AUTHOR
) b
WHERE ROWNUM <= 3
)
WHERE rn > 2
)
```

So, that’s more like it, right? But since you’re not writing this manually every time, you’re about to start creating your own nifty in-house SQL framework covering the 2-3 use cases that you’ve encountered so far, right?

You can do it. So you’ll regex-search-replace column names automagically to produce the above.

## So now, it is correct?

Of course not! Because you can have ambiguous column names in top-level `SELECT`s, but not in nested selects. What if you want to do this:

```-- PostgreSQL syntax:
-- Perfectly valid repetition of two ID columns
SELECT BOOK.ID, AUTHOR.ID
FROM BOOK
JOIN AUTHOR
ON BOOK.AUTHOR_ID = AUTHOR.ID
LIMIT 1 OFFSET 2

-- Oracle equivalent:
SELECT *
FROM (
SELECT b.*, ROWNUM rn
FROM (
-- Ouch! ORA-00918: column ambiguously defined
SELECT BOOK.ID, AUTHOR.ID
FROM BOOK
JOIN AUTHOR
ON BOOK.AUTHOR_ID = AUTHOR.ID
) b
WHERE ROWNUM <= 3
)
WHERE rn > 2
```

Nope. And the trick of manually patching ID columns from the previous example doesn’t work, because you have multiple `ID` instances. And renaming the columns to random values is nasty, because the user of your home-grown in-house database framework wants to receive well-defined column names. I.e. `ID` and… `ID`.

So, the solution is to rename the columns twice. Once in each derived table:

```-- Oracle equivalent:
-- Rename synthetic column names back to original
SELECT c1 ID, c2 ID
FROM (
SELECT b.c1, b.c2, ROWNUM rn
FROM (
-- synthetic column names here
SELECT BOOK.ID c1, AUTHOR.ID c2
FROM BOOK
JOIN AUTHOR
ON BOOK.AUTHOR_ID = AUTHOR.ID
) b
WHERE ROWNUM <= 3
)
WHERE rn > 2
```

## But now, we’re done?

Of course not! What if you doubly nest such a query? Will you think about doubly renaming `ID` columns to synthetic names, and back? … ;-) Let’s leave it here and talk about something entirely different:

## Does the same thing work for SQL Server 2008?

Of course not! In SQL Server 2008, the most popular approach is to use window functions. Namely, `ROW_NUMBER()`. So, let’s consider:

```-- PostgreSQL syntax:
SELECT ID, TITLE
FROM BOOK
LIMIT 1 OFFSET 2

-- SQL Server equivalent:
SELECT b.*
FROM (
SELECT ID, TITLE,
ROW_NUMBER() OVER (ORDER BY ID) rn
FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
```

## So that’s it, right?

Of course not! ;-)

OK, we’ve already had this issue. We should not select `*`, because that would generate too many columns in the case that we’re using this as a subquery for an `IN` predicate. So let’s consider the correct solution with synthetic column names:

```-- SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
SELECT ID c1, TITLE c2,
ROW_NUMBER() OVER (ORDER BY ID) rn
FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
```

## But now we got it, right?

Make an educated guess: Nope!

What happens, if you add an `ORDER BY` clause to the original query?

```-- PostgreSQL syntax:
SELECT ID, TITLE
FROM BOOK
ORDER BY SOME_COLUMN
LIMIT 1 OFFSET 2

-- Naive SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
SELECT ID c1, TITLE c2,
ROW_NUMBER() OVER (ORDER BY ID) rn
FROM BOOK
ORDER BY SOME_COLUMN
) b
WHERE rn > 2 AND rn <= 3
```

Now, that doesn’t work in SQL Server. Subqueries are not allowed to have an `ORDER BY` clause, unless they also have a `TOP` clause (or an `OFFSET .. FETCH` clause in SQL Server 2012).

OK, we can probably tweak this using `TOP 100 PERCENT` to make SQL Server happy.

```-- Better SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
SELECT TOP 100 PERCENT
ID c1, TITLE c2,
ROW_NUMBER() OVER (ORDER BY ID) rn
FROM BOOK
ORDER BY SOME_COLUMN
) b
WHERE rn > 2 AND rn <= 3
```

Now, that’s correct SQL according to SQL Server, although you do not have a guarantee that the ordering of the derived table will survive after query execution. It may well be that the ordering is changed again by some influence.

If you wanted to order by `SOME_COLUMN` in the outer query, you’d have to again transform the SQL statement to add another synthetic column:

```-- Better SQL Server equivalent:
SELECT b.c1 ID, b.c2 TITLE
FROM (
SELECT TOP 100 PERCENT
ID c1, TITLE c2,
SOME_COLUMN c99,
ROW_NUMBER() OVER (ORDER BY ID) rn
FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
ORDER BY b.c99
```

That does start getting a bit nasty. And let’s guess whether:

## This is the correct solution!

Of course not! What if the original query had `DISTINCT` in it?

```-- PostgreSQL syntax:
SELECT DISTINCT AUTHOR_ID
FROM BOOK
LIMIT 1 OFFSET 2

-- Naive SQL Server equivalent:
SELECT b.c1 AUTHOR_ID
FROM (
SELECT DISTINCT AUTHOR_ID c1,
ROW_NUMBER() OVER (ORDER BY AUTHOR_ID) rn
FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
```

Now, what happens if an author has written several books? Yes, the `DISTINCT` keyword should remove such duplicates, and effectively, the PostgreSQL query will correctly remove duplicates first, and then apply `LIMIT` and `OFFSET`.

However, the `ROW_NUMBER()` predicate always generates distinct row numbers before `DISTINCT` can remove them again. In other words, `DISTINCT` has no effect.

```-- Better SQL Server equivalent:
SELECT b.c1 AUTHOR_ID
FROM (
SELECT DISTINCT AUTHOR_ID c1,
DENSE_RANK() OVER (ORDER BY AUTHOR_ID) rn
FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
```

Watch out that the `ORDER BY` clause must contain all columns from the `SELECT` field list. Obviously, this will limit the acceptable columns in the `SELECT DISTINCT` field list to columns that are allowed in a window function’s `ORDER BY` clause (e.g. no other window functions).

We could of course try to fix that as well using common table expressions, or we consider

## Yet another issue??

Yes, of course!

Do you even know what the column(s) in the window function’s `ORDER BY` clause should be? Have you just picked any column, at random? What if that column doesn’t have an index on it, will your window function still perform?

The answer is easy when your original `SELECT` statement also has an `ORDER BY` clause, then you should probably take that one (plus all the columns from the `SELECT DISTINCT` clause if applicable).

But what if you don’t have any `ORDER BY` clause?

Yet another trick! Use a “constant” variable:

```-- Better SQL Server equivalent:
SELECT b.c1 AUTHOR_ID
FROM (
SELECT AUTHOR_ID c1,
ROW_NUMBER() OVER (ORDER BY @@version) rn
FROM BOOK
) b
WHERE rn > 2 AND rn <= 3
```

Yes, you need to use a variable, because constants are not allowed in those `ORDER BY` clauses, in SQL Server. Painful, I know.

## Are we done yet!?!?

Probably not ;-) But we have probably covered around 99% of the common and edge cases. We can sleep nicely, now.

Note that all of these SQL transformations are implemented in jOOQ. jOOQ is the only SQL abstraction framework that takes SQL seriously (with all its warts and caveats), standardising over all of this madness.

As mentioned in the beginning, with jOOQ, you just write:

```// Don't worry about general emulation
select().from(BOOK).limit(1).offset(2);

// Don't worry about duplicate column names
// in subselects
select(BOOK.ID, AUTHOR.ID)
.from(BOOK)
.join(AUTHOR)
.on(BOOK.AUTHOR_ID.eq(AUTHOR.ID))
.limit(1).offset(2);

// Don't worry about invalid IN predicates
select()
.from(BOOK)
.where(BOOK.AUTHOR_ID).in(
select(AUTHOR.ID)
.from(AUTHOR)
.limit(1).offset(2)
);

// Don't worry about the ROW_NUMBER() vs.
// DENSE_RANK() distinction
selectDistinct(AUTHOR_ID)
.from(BOOK).limit(1).offset(2);
```

With jOOQ, you can just write your Oracle SQL or Transact SQL as if it were as awesome as PostgreSQL! … without jumping the SQL ship entirely, and moving on to JPA. ## Keyset paging

Now, of course, if you have been reading our blog, or our partner blog SQL Performance Explained, you should know by now that `OFFSET` pagination is often a bad choice in the first place. You should know that keyset pagination almost always outperforms `OFFSET` pagination.