## How to Find the Closest Subset Sum with SQL

I’ve stumbled upon this very interesting question on Stack Overflow, recently. Its title is:

[How to] compare a number with sum of subset of numbers

In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:

### The problem

The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:

```ID  ASSIGN_AMT
--------------
1        25150
2        19800
3        27511
```

And…

```ID  WORK_AMT
------------
1       7120
2       8150
3       8255
4       9051
5       1220
6      12515
7      13555
8       5221
9        812
10      6562
```

The `ASSIGN_AMT` value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of `WORK_AMT` values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:

1. The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by `ID`. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold
2. The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.

The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique!

But let’s look at the simpler understanding first:

## 1. Calculating a sum of an ordered subset of values

By strictly ordered we mean (in the sense of the question) that we want to order all `WORK_AMT` values, e.g. by `ID`, and allow only for sums that can appear in this particular order. I.e.

```ID  WORK_AMT  SUBSET_SUM
------------------------
1       7120        7120 (= WORK_AMT)
2       8150       15270 (= previous SUBSET_SUM + WORK_AMT)
3       8255       23525 (= previous SUBSET_SUM + WORK_AMT)
4       9051       32576 (= ...)
5       1220       33796
6      12515       46311
7      13555       59866
8       5221       65087
9        812       65899
10      6562       72461
```

The above `SUBSET_SUM` value is the sum of all `WORK_AMT` values with `ID <= "current" ID`. We’ve seen this before on this blog, this is called a running total, and it is best calculated using window functions as follows:

```WITH
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
)
SELECT
ID,
WORK_AMT,
SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM
FROM
VALS
ORDER BY
ID
```

The above window function calculates the sum of all `WORK_AMT` values that are in the “window” of values where the `ID` is smaller or equal to the current `ID`.

### Finding the “closest” of these sums with quantified comparison predicates

Now, the task at hand is to find for each value `ASSIGN_AMT` in `25150`, `19800`, and `27511` the closest value of `SUBSET_SUM`. In a way, what we are trying to do is we want to minimise the expression `ABS(SUBSET_SUM - ASSIGN_AMT)`.

However, the `MIN()` aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of `SUBSET_SUM` that produces this difference in the first place.

One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:

```-- The common table expressions are the same as
-- in the previous examples
WITH
ASSIGN(ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
VALS (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
),
SUMS (ID, WORK_AMT, SUBSET_SUM) AS (
SELECT
VALS.*,
SUM (WORK_AMT) OVER (ORDER BY ID)
FROM
VALS
)

-- This is now the interesting part, where we
-- calculate the closest sum
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM)
FROM
SUMS
)
```

The quantified comparison predicate reads intuitively. We’re looking for that specific `SUBSET_SUM` whose difference to the “expected” `ASSIGN_AMT` is smaller or equal to all the other possible differences.

The above query yields:

```ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525
```

In this case, it’s always the same.

You may have noticed that the solution is not entirely correct in the event where `WORK_AMT` is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the `JOIN`. This can be achieved when replacing:

```        UNION ALL SELECT 4 , 0     FROM DUAL
```

Now, the result is:

```ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
2        19800       23525
3        27511       23525
```

One solution is to remove those duplicates using `DISTINCT` (which is an anti-pattern. See #6 in this list). A better solution is to make the `JOIN` predicate unambiguous by comparing `ID` values as well, i.e.:

```SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
SUBSET_SUM
FROM
ASSIGN
JOIN
SUMS
ON
(ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL (
SELECT
ABS (ASSIGN_AMT - SUBSET_SUM),
ID
FROM
SUMS
)
```

The above unfortunately doesn’t work in Oracle (yet), which will report an error:

```ORA-01796: this operator cannot be used with lists
```

Oracle supports comparing tuples / row value expressions only with equal comparators, not with less than / greater than comparators – which is a shame. The same query runs smoothlessly in PostgreSQL.

### Finding the “closest” of these sums with Oracle’s FIRST function

Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:

```SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT
```

Essentially, we’re grouping all values from the `SUMS` table for each `ASSIGN_AMT`. For each of those groups, we’ll look for the `"FIRST"` row that appears when ordering rows within the group by our criteria `ABS(ASSIGN_AMT - SUBSET_SUM)`. We `"KEEP"` only those rows in the group, and retain from those rows the minimum `SUBSET_SUM`.

This query will again yield:

```ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525
```

This is an extremely nice functionality that can come in handy every once in a while.

Remember that we’ve seen a similar feature recently on this blog, when we were looking for `FIRST_VALUE` (or `LAST_VALUE`) within the `PARTITION` of a window. In standard SQL, a similar thing can be achieved using window functions as such:

```SELECT DISTINCT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
FIRST_VALUE (SUBSET_SUM) OVER (
PARTITION BY ASSIGN.ID
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN
SUMS
```

Unfortunately, these solutions all produce duplicates, which we have to remove either via `GROUP BY` (`KEEP` solution), or via `DISTINCT` (`FIRST_VALUE` solution).

### Finding the “closest” of these sums with LATERAL JOIN

A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new `FETCH FIRST` clause along with `CROSS JOIN LATERAL` (or `CROSS APPLY`, which is the same)

```SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
CLOSEST_SUM
FROM
ASSIGN
CROSS JOIN LATERAL (
SELECT
SUBSET_SUM AS CLOSEST_SUM
FROM
SUMS
ORDER BY
ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM)
FETCH FIRST 1 ROW ONLY
) SUMS
```

What does this mean? We’re essentially joining for each value in `ASSIGN` only the `FIRST 1 ROW` in `SUMS`, ordered by the usual criteria. We need `LATERAL` (or `APPLY`), because this allows us to reference columns from the left side of the `JOIN` expression also in the right side, which wouldn’t be possible otherwise.

The same functionality is supported in SQL Server (only using `CROSS APPLY`), or in PostgreSQL (only using `CROSS JOIN LATERAL`).

Lateral can be very useful whenever the right hand side of a `JOIN` depends on the left hand side. Unlike ordinary joins, this means that the `JOIN` order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with `ORDER BY` and `FETCH FIRST`), or when joining unnested table-valued functions, which we’ll cover in a future blog post.

## 2. Calculating a sum of any subset of values

So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of `WORK_AMT` values.

We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:

In recursive SQL, we need a `UNION ALL` query in a common table expression (`WITH` clause in Oracle, or `WITH RECURSIVE` clause in PostgreSQL). The first subquery of `UNION ALL` generates the “seed row(s)” of the recursion, and the second subqeury of `UNION ALL` generates the recursion based on a `SELECT` that selects from the table being declared, recursively.

So, a naive solution to this subset sum problem can be seen here:

```-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID) AS (
SELECT
WORK_AMT,
ID
FROM
WORK

UNION ALL

SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)

-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT
```

The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in `WORK`:

```        SELECT
WORK_AMT,
ID
FROM
WORK
```

In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (`SUMS`) with all the remaining values (`WORK`), i.e. all the values that have a higher `ID`:

```        SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
```

In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative.

The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the `SUMS` table.

After a still acceptable 0.112s for 10 different `WORK_AMT` values, the database calculated:

```ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25133
2        19800        19768
3        27511        27488
```

But beware, as soon as you start adding values to the `VALS` table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightly bigger `WORK` table already requires 16.3 seconds to yield a result:

```    WORK(ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
UNION ALL SELECT 11, 1234  FROM DUAL
UNION ALL SELECT 12, 61    FROM DUAL
UNION ALL SELECT 13, 616   FROM DUAL
UNION ALL SELECT 14, 2456  FROM DUAL
UNION ALL SELECT 15, 5161  FROM DUAL
UNION ALL SELECT 16, 414   FROM DUAL
UNION ALL SELECT 17, 516   FROM DUAL
UNION ALL SELECT 18, 617   FROM DUAL
UNION ALL SELECT 19, 146   FROM DUAL
),
```

And the result would be:

```ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25150
2        19800        19800
3        27511        27511
```

Want proof about the actual sum? That’s easy as well with recursive SQL:

```-- Repetition of the previous data
WITH
ASSIGN (ID, ASSIGN_AMT) AS (
SELECT 1, 25150 FROM DUAL
UNION ALL SELECT 2, 19800 FROM DUAL
UNION ALL SELECT 3, 27511 FROM DUAL
),
WORK (ID, WORK_AMT) AS (
SELECT 1 , 7120  FROM DUAL
UNION ALL SELECT 2 , 8150  FROM DUAL
UNION ALL SELECT 3 , 8255  FROM DUAL
UNION ALL SELECT 4 , 9051  FROM DUAL
UNION ALL SELECT 5 , 1220  FROM DUAL
UNION ALL SELECT 6 , 12515 FROM DUAL
UNION ALL SELECT 7 , 13555 FROM DUAL
UNION ALL SELECT 8 , 5221  FROM DUAL
UNION ALL SELECT 9 , 812   FROM DUAL
UNION ALL SELECT 10, 6562  FROM DUAL
),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
SUMS (SUBSET_SUM, MAX_ID, CALC) AS (
SELECT
WORK_AMT,
ID,
TO_CHAR(WORK_AMT)
FROM WORK

UNION ALL

SELECT
WORK_AMT + SUBSET_SUM,
WORK.ID,
CALC || '+' || WORK_AMT
FROM
SUMS
JOIN
WORK
ON
SUMS.MAX_ID < WORK.ID
)

-- The same technique to match the "closest" sum
-- As before
SELECT
ASSIGN.ID,
ASSIGN.ASSIGN_AMT,
MIN (SUBSET_SUM) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CLOSEST_SUM,
MIN (CALC) KEEP (
DENSE_RANK FIRST
ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
) AS CALCULATION
FROM
SUMS
CROSS JOIN
ASSIGN
GROUP BY
ASSIGN.ID, ASSIGN.ASSIGN_AMT
```

The above now yields:

```ID  ASSIGN_AMT  CLOSEST_SUM  CALCULATION
------------------------------------------------------------
1        25150        25133  7120 + 8150 + 9051 + 812
2        19800        19768  1220 + 12515 + 5221 + 812
3        27511        27488  8150 + 8255 + 9051 + 1220 + 812
```

## Conclusion

SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features:

• Window functions
• `KEEP FIRST` (in Oracle only)
• `LATERAL JOIN` (or `APPLY`
• Recursive SQL

We’d like to point out that all of these solutions can be written in Java using jOOQ in a type safe manner as well. Finally, a lot of this grounds is covered in more detail in any of the following articles. Enjoy learning, and enjoy SQL!

## Recursive SQL for Data Normalisation

Recursive SQL can be awesome, although a bit hard to read in its SQL standard beauty. Let’s assume you have some aggregated data with dates and a number of events per date:

```|                           DATE | COUNT |
|--------------------------------|-------|
| October, 01 2013 00:00:00+0000 |     2 |
| October, 02 2013 00:00:00+0000 |     1 |
| October, 03 2013 00:00:00+0000 |     3 |
| October, 04 2013 00:00:00+0000 |     4 |
| October, 05 2013 00:00:00+0000 |     2 |
| October, 06 2013 00:00:00+0000 |     0 |
| October, 07 2013 00:00:00+0000 |     2 |```

Now let’s assume you want to normalise or “unaggregate” this data, generating “COUNT” records per date. The desired output is this:

```|                           DATE | EVENT_NUMBER |
|--------------------------------|--------------|
| October, 01 2013 00:00:00+0000 |            1 |
| October, 01 2013 00:00:00+0000 |            2 |
| October, 02 2013 00:00:00+0000 |            1 |
| October, 03 2013 00:00:00+0000 |            1 |
| October, 03 2013 00:00:00+0000 |            2 |
| October, 03 2013 00:00:00+0000 |            3 |
| October, 04 2013 00:00:00+0000 |            1 |
| October, 04 2013 00:00:00+0000 |            2 |
| October, 04 2013 00:00:00+0000 |            3 |
| October, 04 2013 00:00:00+0000 |            4 |
| October, 05 2013 00:00:00+0000 |            1 |
| October, 05 2013 00:00:00+0000 |            2 |
| October, 07 2013 00:00:00+0000 |            1 |
| October, 07 2013 00:00:00+0000 |            2 |```

As you may have noticed, there are no records for those dates with zero events (October 06). With recursive SQL, this is rather simple to achieve.

```with recursive

-- Data could also be a regular table containing
-- the actual data
data(date, count) as (
select date '2013-10-01', 2 union all
select date '2013-10-02', 1 union all
select date '2013-10-03', 3 union all
select date '2013-10-04', 4 union all
select date '2013-10-05', 2 union all
select date '2013-10-06', 0 union all
select date '2013-10-07', 2
),

-- This is the recursive common table expression
-- It starts with all data where count > 0
-- ... and then recurses by subtracting one
recurse(date, count) as (
select date, count
from data
where count > 0
union all
select date, count - 1
from recurse
where count > 1
)
select date, count event_number from recurse
order by date asc, event_number asc;
```

Incredibly, Oracle’s CONNECT BY clause doesn’t seem to be an option here. I challenge you to find a better solution, though! For instance, this beautiful solution that works with PostgreSQL:

```with recursive
data(date, count) as (
select date '2013-10-01', 2 union all
select date '2013-10-02', 1 union all
select date '2013-10-03', 3 union all
select date '2013-10-04', 4 union all
select date '2013-10-05', 2 union all
select date '2013-10-06', 0 union all
select date '2013-10-07', 2
)
select date, generate_series(1, count) event_number
from data
where count > 0
order by date asc, event_number asc;
```

## Recursive queries with Oracle’s CONNECT BY clause

Recursive or hierarchical queries are an awkward thing in SQL. Some RDBMS allow for recursiveness in Common Table Expressions (CTE’s), but those queries tend to be quite unreadable. That’s not the case for Oracle, which, in addition to recursive CTE’s also supports a dedicated CONNECT BY clause. The general syntax for this clause looks something like this:

```--   SELECT ..
--     FROM ..
--    WHERE ..
CONNECT BY [NOCYCLE] condition [AND condition, ...]
-- GROUP BY ..
```

Iterative queries are very simple to express with jOOQ as well. Just use the connectBy() method as such:

```// Some Oracle-specific features are only available
// from the OracleFactory
OracleFactory create = new OracleFactory(connection);

// Get a table with elements 1, 2, 3, 4, 5
create.select(create.rownum())
.connectBy(create.level().lessOrEqual(5))
.fetch();
```

Recursive queries are simple to express as well. Let’s say, you have a DIRECTORY table with ID, PARENT_ID, NAME columns. In order to recursively fetch all directories and calculate their absolute paths, you could issue a query like this in jOOQ. Note the usage of Oracle’s CONNECT BY, START WITH, and PRIOR keywords, as well as the SYS_CONNECT_BY_PATH function:

``` OracleFactory ora = new OracleFactory(connection);

List<?> paths =
ora.select(ora.sysConnectByPath(Directory.NAME, "/").substring(2))
.from(Directory)
.connectBy(ora.prior(Directory.ID).equal(Directory.PARENT_ID))
.startWith(Directory.PARENT_ID.isNull())
.orderBy(ora.literal(1))
.fetch(0);
```

jOOQ’s output could then look like this:

```+------------------------------------------------+
|substring                                       |
+------------------------------------------------+
|C:                                              |
|C:/eclipse                                      |
|C:/eclipse/configuration                        |
|C:/eclipse/dropins                              |
|C:/eclipse/eclipse.exe                          |
+------------------------------------------------+
|...21 record(s) truncated...```

In the near future, jOOQ is going to project the CONNECT BY syntax and API to other RDBMS’s Common Table Expression syntax. That way, you can express hierarchical queries in any database supporting CTE’s using Oracle’s CONNECT BY syntax.