How to Find the Closest Subset Sum with SQL

I’ve stumbled upon this very interesting question on Stack Overflow, recently. Its title is:
[How to] compare a number with sum of subset of numbers
In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:

The problem

The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:
ID  ASSIGN_AMT
--------------
1        25150
2        19800
3        27511
And…
ID  WORK_AMT
------------
1       7120
2       8150
3       8255
4       9051
5       1220
6      12515
7      13555
8       5221
9        812
10      6562
The ASSIGN_AMT value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of WORK_AMT values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:
  1. The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by ID. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold
  2. The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.
The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique! But let’s look at the simpler understanding first:

1. Calculating a sum of an ordered subset of values

By strictly ordered we mean (in the sense of the question) that we want to order all WORK_AMT values, e.g. by ID, and allow only for sums that can appear in this particular order. I.e.
ID  WORK_AMT  SUBSET_SUM
------------------------
1       7120        7120 (= WORK_AMT)
2       8150       15270 (= previous SUBSET_SUM + WORK_AMT)      
3       8255       23525 (= previous SUBSET_SUM + WORK_AMT)
4       9051       32576 (= ...)
5       1220       33796
6      12515       46311
7      13555       59866
8       5221       65087
9        812       65899
10      6562       72461
The above SUBSET_SUM value is the sum of all WORK_AMT values with ID <= "current" ID. We’ve seen this before on this blog, this is called a running total, and it is best calculated using window functions as follows:

WITH
    VALS (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    )
SELECT
    ID,
    WORK_AMT,
    SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM
FROM
    VALS
ORDER BY
    ID

The above window function calculates the sum of all WORK_AMT values that are in the “window” of values where the ID is smaller or equal to the current ID.

Finding the “closest” of these sums with quantified comparison predicates

Now, the task at hand is to find for each value ASSIGN_AMT in 25150, 19800, and 27511 the closest value of SUBSET_SUM. In a way, what we are trying to do is we want to minimise the expression ABS(SUBSET_SUM - ASSIGN_AMT). However, the MIN() aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of SUBSET_SUM that produces this difference in the first place. One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:

-- The common table expressions are the same as
-- in the previous examples
WITH
    ASSIGN(ID, ASSIGN_AMT) AS (
                  SELECT 1, 25150 FROM DUAL 
        UNION ALL SELECT 2, 19800 FROM DUAL
        UNION ALL SELECT 3, 27511 FROM DUAL
    ),
    VALS (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    ),
    SUMS (ID, WORK_AMT, SUBSET_SUM) AS (
        SELECT 
            VALS.*, 
            SUM (WORK_AMT) OVER (ORDER BY ID)
        FROM 
            VALS
    )

-- This is now the interesting part, where we
-- calculate the closest sum
SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    SUBSET_SUM
FROM
    ASSIGN
JOIN
    SUMS 
ON 
    ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL (
        SELECT
            ABS (ASSIGN_AMT - SUBSET_SUM) 
        FROM
            SUMS
)

The quantified comparison predicate reads intuitively. We’re looking for that specific SUBSET_SUM whose difference to the “expected” ASSIGN_AMT is smaller or equal to all the other possible differences. The above query yields:
ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525
In this case, it’s always the same. You may have noticed that the solution is not entirely correct in the event where WORK_AMT is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the JOIN. This can be achieved when replacing:

        UNION ALL SELECT 4 , 0     FROM DUAL

Now, the result is:
ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
2        19800       23525
3        27511       23525
One solution is to remove those duplicates using DISTINCT (which is an anti-pattern. See #6 in this list). A better solution is to make the JOIN predicate unambiguous by comparing ID values as well, i.e.:

SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    SUBSET_SUM
FROM
    ASSIGN
JOIN
    SUMS 
ON 
    (ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL (
        SELECT
            ABS (ASSIGN_AMT - SUBSET_SUM),
            ID
        FROM
            SUMS
)

The above unfortunately doesn’t work in Oracle (yet), which will report an error:
ORA-01796: this operator cannot be used with lists
Oracle supports comparing tuples / row value expressions only with equal comparators, not with less than / greater than comparators – which is a shame. The same query runs smoothlessly in PostgreSQL.

Finding the “closest” of these sums with Oracle’s FIRST function

Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:

SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    MIN (SUBSET_SUM) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM
FROM
    ASSIGN
CROSS JOIN 
    SUMS
GROUP BY
    ASSIGN.ID, ASSIGN.ASSIGN_AMT

Essentially, we’re grouping all values from the SUMS table for each ASSIGN_AMT. For each of those groups, we’ll look for the "FIRST" row that appears when ordering rows within the group by our criteria ABS(ASSIGN_AMT - SUBSET_SUM). We "KEEP" only those rows in the group, and retain from those rows the minimum SUBSET_SUM. This query will again yield:
ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525
This is an extremely nice functionality that can come in handy every once in a while. Remember that we’ve seen a similar feature recently on this blog, when we were looking for FIRST_VALUE (or LAST_VALUE) within the PARTITION of a window. In standard SQL, a similar thing can be achieved using window functions as such:

SELECT DISTINCT
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    FIRST_VALUE (SUBSET_SUM) OVER (
        PARTITION BY ASSIGN.ID
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM
FROM
    ASSIGN
CROSS JOIN 
    SUMS

Unfortunately, these solutions all produce duplicates, which we have to remove either via GROUP BY (KEEP solution), or via DISTINCT (FIRST_VALUE solution).

Finding the “closest” of these sums with LATERAL JOIN

A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new FETCH FIRST clause along with CROSS JOIN LATERAL (or CROSS APPLY, which is the same)

SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    CLOSEST_SUM
FROM
    ASSIGN
CROSS JOIN LATERAL (
    SELECT
        SUBSET_SUM AS CLOSEST_SUM
    FROM
        SUMS
    ORDER BY
        ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM)
    FETCH FIRST 1 ROW ONLY
) SUMS

What does this mean? We’re essentially joining for each value in ASSIGN only the FIRST 1 ROW in SUMS, ordered by the usual criteria. We need LATERAL (or APPLY), because this allows us to reference columns from the left side of the JOIN expression also in the right side, which wouldn’t be possible otherwise. The same functionality is supported in SQL Server (only using CROSS APPLY), or in PostgreSQL (only using CROSS JOIN LATERAL). Lateral can be very useful whenever the right hand side of a JOIN depends on the left hand side. Unlike ordinary joins, this means that the JOIN order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with ORDER BY and FETCH FIRST), or when joining unnested table-valued functions, which we’ll cover in a future blog post.

2. Calculating a sum of any subset of values

So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of WORK_AMT values. We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:
Slide taken from the jOOQ advanced SQL Training. Contact us to learn more.
Slide taken from the jOOQ advanced SQL Training. Contact us to learn more.
In recursive SQL, we need a UNION ALL query in a common table expression (WITH clause in Oracle, or WITH RECURSIVE clause in PostgreSQL). The first subquery of UNION ALL generates the “seed row(s)” of the recursion, and the second subqeury of UNION ALL generates the recursion based on a SELECT that selects from the table being declared, recursively. So, a naive solution to this subset sum problem can be seen here:

-- Repetition of the previous data
WITH 
    ASSIGN (ID, ASSIGN_AMT) AS (
                  SELECT 1, 25150 FROM DUAL 
        UNION ALL SELECT 2, 19800 FROM DUAL
        UNION ALL SELECT 3, 27511 FROM DUAL
    ),
    WORK (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    ),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
    SUMS (SUBSET_SUM, MAX_ID) AS (
        SELECT 
            WORK_AMT, 
            ID
        FROM 
            WORK
        
        UNION ALL
        
        SELECT 
            WORK_AMT + SUBSET_SUM, 
            WORK.ID
        FROM 
            SUMS
        JOIN 
            WORK
        ON 
            SUMS.MAX_ID < WORK.ID
    )

-- The same technique to match the "closest" sum
-- As before
SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    MIN (SUBSET_SUM) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM
FROM 
    SUMS
CROSS JOIN 
    ASSIGN
GROUP BY 
    ASSIGN.ID, ASSIGN.ASSIGN_AMT

The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in WORK:

        SELECT 
            WORK_AMT, 
            ID
        FROM 
            WORK

In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (SUMS) with all the remaining values (WORK), i.e. all the values that have a higher ID:

        SELECT 
            WORK_AMT + SUBSET_SUM, 
            WORK.ID
        FROM 
            SUMS
        JOIN 
            WORK
        ON 
            SUMS.MAX_ID < WORK.ID

In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative. The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the SUMS table. After a still acceptable 0.112s for 10 different WORK_AMT values, the database calculated:
ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25133
2        19800        19768
3        27511        27488
But beware, as soon as you start adding values to the VALS table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightly bigger WORK table already requires 16.3 seconds to yield a result:

    WORK(ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
        UNION ALL SELECT 11, 1234  FROM DUAL
        UNION ALL SELECT 12, 61    FROM DUAL
        UNION ALL SELECT 13, 616   FROM DUAL
        UNION ALL SELECT 14, 2456  FROM DUAL
        UNION ALL SELECT 15, 5161  FROM DUAL
        UNION ALL SELECT 16, 414   FROM DUAL
        UNION ALL SELECT 17, 516   FROM DUAL
        UNION ALL SELECT 18, 617   FROM DUAL
        UNION ALL SELECT 19, 146   FROM DUAL
    ),

And the result would be:
ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25150
2        19800        19800
3        27511        27511
Want proof about the actual sum? That’s easy as well with recursive SQL:

-- Repetition of the previous data
WITH 
    ASSIGN (ID, ASSIGN_AMT) AS (
                  SELECT 1, 25150 FROM DUAL 
        UNION ALL SELECT 2, 19800 FROM DUAL
        UNION ALL SELECT 3, 27511 FROM DUAL
    ),
    WORK (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    ),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
    SUMS (SUBSET_SUM, MAX_ID, CALC) AS (
        SELECT 
            WORK_AMT, 
            ID, 
            TO_CHAR(WORK_AMT)
        FROM WORK
        
        UNION ALL
        
        SELECT 
            WORK_AMT + SUBSET_SUM, 
            WORK.ID,
            CALC || '+' || WORK_AMT
        FROM 
            SUMS
        JOIN 
            WORK
        ON 
            SUMS.MAX_ID < WORK.ID
    )

-- The same technique to match the "closest" sum
-- As before
SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    MIN (SUBSET_SUM) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM,
    MIN (CALC) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CALCULATION
FROM 
    SUMS
CROSS JOIN 
    ASSIGN
GROUP BY 
    ASSIGN.ID, ASSIGN.ASSIGN_AMT

The above now yields:
ID  ASSIGN_AMT  CLOSEST_SUM  CALCULATION
------------------------------------------------------------
1        25150        25133  7120 + 8150 + 9051 + 812
2        19800        19768  1220 + 12515 + 5221 + 812
3        27511        27488  8150 + 8255 + 9051 + 1220 + 812

Conclusion

SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features:
  • Window functions
  • KEEP FIRST (in Oracle only)
  • LATERAL JOIN (or APPLY
  • Recursive SQL
Did you like this article? Would you like to learn more about advanced SQL? Contact us to enquire about our advanced SQL training sessions, where we help you understand the simplicity of set-oriented thinking and calculations with SQL. We’d like to point out that all of these solutions can be written in Java using jOOQ in a type safe manner as well. jOOQ: The best way to write SQL in Java Finally, a lot of this grounds is covered in more detail in any of the following articles. Enjoy learning, and enjoy SQL!

8 thoughts on “How to Find the Closest Subset Sum with SQL

  1. My advice for all people reading this article will be following: While provided solution is very elegant, there are however many ways to skin a cat. Using pl/sql or java to solve you problem will be a valid solution.

  2. Thanks for the outstanding tutorial and explanation. I needed to do exactly this and so far, I’ve learned a lot. Especially helpful was the calculation column that showed how the SUM was obtained. I have a further requirement however. What if instead of the amount of the assignment, I wanted a list of the work_id’s used to calculate the sum? I’ve figured out how to add-in the
    work_ids || ‘+’ || work_id
    part, but the MIN / KEEP / DENSE_RANK part is throwing me off. When I try to add that, it just gives me a much larger resultset (each assign_id having many many rows).

    Thoughts? I need the ID because i’ll need to update the “work” records to tie them together to (to persist the aggregation).

    Thanks!!!

    1. Hey, great to hear that this extremely rare problem / solution found its way your way, and that you could figure out the issue you were having :)

  3. nevermind, figured it out. i had the pk column added into the group by at the very bottom. realized i didn’t need it cause we had an aggregate function applied to it in the select list. all the extra rows are gone now. thanks once again!

  4. hey, I’m back! this time with thoughts on performance.

    my understanding (with approach #2) is that the number of possible permutations is 2^n where n represents the number of records you are summing (in this case, it would be 10 for the 10 assignments).

    i ran approach #2 across our dataset. in simple examples, it runs fast/very fast. in the example where n was 16, it ran in 1 min and 49 seconds. (not too bad but definitely the upper limit on what is acceptable because we would have to do this multiple times per day). for that effort, it had to try 65,536 possibilities. Unfortunately, in one of our use cases, i’m seeing where n would be 68. That’s 295,147,905,179,352,825,856 possibilities. I let the query run for 30 mins before killing the session. long ops seemed to indicate it was going to take over 3 hours. No way we can accept that for one occurrence.

    in our situation, we are looking for “perfect matches”. we are dealing with transactions and bank deposits, so the sums will always be equal if the data exists (or it wont match perfectly if we haven’t yet received transactions that need to be counted). is there a way to modify our approach so that if we find a perfect match, we’ll drop out and not calculate the remaining possibilities? i’m assuming i’d be able to do that if i was looping through results in pl/sql instead of letting the recursive query do the work, but i’d rather not take that approach since figuring out a way to ensure all combos are tried seems difficult (1 query seems much cleaner).

    additionally, it might be helpful to know how the execution of this query occurs. In our use-case, we actually ave a SUM()/group by in one of our WITH tables. We actually need to do a preliminary sum that we then sum again together to equal the total amount. the tables that contain this data are going to be arbitrarily large. i have them indexed which makes things quick for now, but i can’t tell if those “early” calculations are being done tons of times (one time for each recursive call). if it is constantly recalculating them, it might be easier for me to do the calc once and store it in a table and have our WITH query reference that table instead.

    lastly, oracle’s explain plan seems to have trouble predicting the cost of the total query. it seems to understand the basics of what’s going on in the WITH queries (indexes to use/etc), but can’t predict how much strain the recursion part will introduce. i was hoping to be able to fine tune this, but unfortunately, it looks like your warnings about exponential growth and scalability were correct.

    i guess the main question is, knowing that we have perfect matches and not closest sum, is there someway we can cheat around the performance limits? any direction (links/etc) would be helpful. obviously, i realize i asked a lot of questions and i don’t expect you to do our work for me!

    thanks!

  5. Thanks for a great blog post!
    I think you have a typo.
    In the text “You may have noticed that the solution is not entirely correct in the event where ASSIGN_AMT is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the JOIN.” you really mean WORK_AMT, not ASSIGN_AMT.

Leave a Reply