How to Find the Closest Subset Sum with SQL

I’ve stumbled upon this very interesting question on Stack Overflow, recently. Its title is:

[How to] compare a number with sum of subset of numbers

In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:

The problem

The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:

ID  ASSIGN_AMT
--------------
1        25150
2        19800
3        27511

And…

ID  WORK_AMT
------------
1       7120
2       8150
3       8255
4       9051
5       1220
6      12515
7      13555
8       5221
9        812
10      6562

The ASSIGN_AMT value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of WORK_AMT values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:

  1. The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by ID. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold
  2. The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.

The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique!

But let’s look at the simpler understanding first:

1. Calculating a sum of an ordered subset of values

By strictly ordered we mean (in the sense of the question) that we want to order all WORK_AMT values, e.g. by ID, and allow only for sums that can appear in this particular order. I.e.

ID  WORK_AMT  SUBSET_SUM
------------------------
1       7120        7120 (= WORK_AMT)
2       8150       15270 (= previous SUBSET_SUM + WORK_AMT)      
3       8255       23525 (= previous SUBSET_SUM + WORK_AMT)
4       9051       32576 (= ...)
5       1220       33796
6      12515       46311
7      13555       59866
8       5221       65087
9        812       65899
10      6562       72461

The above SUBSET_SUM value is the sum of all WORK_AMT values with ID <= "current" ID. We’ve seen this before on this blog, this is called a running total, and it is best calculated using window functions as follows:

WITH
    VALS (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    )
SELECT
    ID,
    WORK_AMT,
    SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM
FROM
    VALS
ORDER BY
    ID

The above window function calculates the sum of all WORK_AMT values that are in the “window” of values where the ID is smaller or equal to the current ID.

Finding the “closest” of these sums with quantified comparison predicates

Now, the task at hand is to find for each value ASSIGN_AMT in 25150, 19800, and 27511 the closest value of SUBSET_SUM. In a way, what we are trying to do is we want to minimise the expression ABS(SUBSET_SUM - ASSIGN_AMT).

However, the MIN() aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of SUBSET_SUM that produces this difference in the first place.

One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:

-- The common table expressions are the same as
-- in the previous examples
WITH
    ASSIGN(ID, ASSIGN_AMT) AS (
                  SELECT 1, 25150 FROM DUAL 
        UNION ALL SELECT 2, 19800 FROM DUAL
        UNION ALL SELECT 3, 27511 FROM DUAL
    ),
    VALS (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    ),
    SUMS (ID, WORK_AMT, SUBSET_SUM) AS (
        SELECT 
            VALS.*, 
            SUM (WORK_AMT) OVER (ORDER BY ID)
        FROM 
            VALS
    )

-- This is now the interesting part, where we
-- calculate the closest sum
SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    SUBSET_SUM
FROM
    ASSIGN
JOIN
    SUMS 
ON 
    ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL (
        SELECT
            ABS (ASSIGN_AMT - SUBSET_SUM) 
        FROM
            SUMS
)

The quantified comparison predicate reads intuitively. We’re looking for that specific SUBSET_SUM whose difference to the “expected” ASSIGN_AMT is smaller or equal to all the other possible differences.

The above query yields:

ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525

In this case, it’s always the same.

You may have noticed that the solution is not entirely correct in the event where ASSIGN_AMT is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the JOIN. This can be achieved when replacing:

        UNION ALL SELECT 4 , 0     FROM DUAL

Now, the result is:

ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
2        19800       23525
3        27511       23525

One solution is to remove those duplicates using DISTINCT (which is an anti-pattern. See #6 in this list). A better solution is to make the JOIN predicate unambiguous by comparing ID values as well, i.e.:

SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    SUBSET_SUM
FROM
    ASSIGN
JOIN
    SUMS 
ON 
    (ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL (
        SELECT
            ABS (ASSIGN_AMT - SUBSET_SUM),
            ID
        FROM
            SUMS
)

The above unfortunately doesn’t work in Oracle (yet), which will report an error:

ORA-01796: this operator cannot be used with lists

Oracle supports comparing tuples / row value expressions only with equal comparators, not with less than / greater than comparators – which is a shame. The same query runs smoothlessly in PostgreSQL.

Finding the “closest” of these sums with Oracle’s FIRST function

Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:

SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    MIN (SUBSET_SUM) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM
FROM
    ASSIGN
CROSS JOIN 
    SUMS
GROUP BY
    ASSIGN.ID, ASSIGN.ASSIGN_AMT

Essentially, we’re grouping all values from the SUMS table for each ASSIGN_AMT. For each of those groups, we’ll look for the "FIRST" row that appears when ordering rows within the group by our criteria ABS(ASSIGN_AMT - SUBSET_SUM). We "KEEP" only those rows in the group, and retain from those rows the minimum SUBSET_SUM.

This query will again yield:

ID  ASSIGN_AMT  SUBSET_SUM
--------------------------
1        25150       23525
2        19800       23525
3        27511       23525

This is an extremely nice functionality that can come in handy every once in a while.

Remember that we’ve seen a similar feature recently on this blog, when we were looking for FIRST_VALUE (or LAST_VALUE) within the PARTITION of a window. In standard SQL, a similar thing can be achieved using window functions as such:

SELECT DISTINCT
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    FIRST_VALUE (SUBSET_SUM) OVER (
        PARTITION BY ASSIGN.ID
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM
FROM
    ASSIGN
CROSS JOIN 
    SUMS

Unfortunately, these solutions all produce duplicates, which we have to remove either via GROUP BY (KEEP solution), or via DISTINCT (FIRST_VALUE solution).

Finding the “closest” of these sums with LATERAL JOIN

A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new FETCH FIRST clause along with CROSS JOIN LATERAL (or CROSS APPLY, which is the same)

SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    CLOSEST_SUM
FROM
    ASSIGN
CROSS JOIN LATERAL (
    SELECT
        SUBSET_SUM AS CLOSEST_SUM
    FROM
        SUMS
    ORDER BY
        ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM)
    FETCH FIRST 1 ROW ONLY
) SUMS

What does this mean? We’re essentially joining for each value in ASSIGN only the FIRST 1 ROW in SUMS, ordered by the usual criteria. We need LATERAL (or APPLY), because this allows us to reference columns from the left side of the JOIN expression also in the right side, which wouldn’t be possible otherwise.

The same functionality is supported in SQL Server (only using CROSS APPLY), or in PostgreSQL (only using CROSS JOIN LATERAL).

Lateral can be very useful whenever the right hand side of a JOIN depends on the left hand side. Unlike ordinary joins, this means that the JOIN order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with ORDER BY and FETCH FIRST), or when joining unnested table-valued functions, which we’ll cover in a future blog post.

2. Calculating a sum of any subset of values

So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of WORK_AMT values.

We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:

In recursive SQL, we need a UNION ALL query in a common table expression (WITH clause in Oracle, or WITH RECURSIVE clause in PostgreSQL). The first subquery of UNION ALL generates the “seed row(s)” of the recursion, and the second subqeury of UNION ALL generates the recursion based on a SELECT that selects from the table being declared, recursively.

So, a naive solution to this subset sum problem can be seen here:

-- Repetition of the previous data
WITH 
    ASSIGN (ID, ASSIGN_AMT) AS (
                  SELECT 1, 25150 FROM DUAL 
        UNION ALL SELECT 2, 19800 FROM DUAL
        UNION ALL SELECT 3, 27511 FROM DUAL
    ),
    WORK (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    ),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
    SUMS (SUBSET_SUM, MAX_ID) AS (
        SELECT 
            WORK_AMT, 
            ID
        FROM 
            WORK
        
        UNION ALL
        
        SELECT 
            WORK_AMT + SUBSET_SUM, 
            WORK.ID
        FROM 
            SUMS
        JOIN 
            WORK
        ON 
            SUMS.MAX_ID < WORK.ID
    )

-- The same technique to match the "closest" sum
-- As before
SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    MIN (SUBSET_SUM) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM
FROM 
    SUMS
CROSS JOIN 
    ASSIGN
GROUP BY 
    ASSIGN.ID, ASSIGN.ASSIGN_AMT

The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in WORK:

        SELECT 
            WORK_AMT, 
            ID
        FROM 
            WORK

In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (SUMS) with all the remaining values (WORK), i.e. all the values that have a higher ID:

        SELECT 
            WORK_AMT + SUBSET_SUM, 
            WORK.ID
        FROM 
            SUMS
        JOIN 
            WORK
        ON 
            SUMS.MAX_ID < WORK.ID

In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative.

The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the SUMS table.

After a still acceptable 0.112s for 10 different WORK_AMT values, the database calculated:

ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25133
2        19800        19768
3        27511        27488

But beware, as soon as you start adding values to the VALS table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightly bigger WORK table already requires 16.3 seconds to yield a result:

    WORK(ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
        UNION ALL SELECT 11, 1234  FROM DUAL
        UNION ALL SELECT 12, 61    FROM DUAL
        UNION ALL SELECT 13, 616   FROM DUAL
        UNION ALL SELECT 14, 2456  FROM DUAL
        UNION ALL SELECT 15, 5161  FROM DUAL
        UNION ALL SELECT 16, 414   FROM DUAL
        UNION ALL SELECT 17, 516   FROM DUAL
        UNION ALL SELECT 18, 617   FROM DUAL
        UNION ALL SELECT 19, 146   FROM DUAL
    ),

And the result would be:

ID  ASSIGN_AMT  CLOSEST_SUM
---------------------------
1        25150        25150
2        19800        19800
3        27511        27511

Want proof about the actual sum? That’s easy as well with recursive SQL:

-- Repetition of the previous data
WITH 
    ASSIGN (ID, ASSIGN_AMT) AS (
                  SELECT 1, 25150 FROM DUAL 
        UNION ALL SELECT 2, 19800 FROM DUAL
        UNION ALL SELECT 3, 27511 FROM DUAL
    ),
    WORK (ID, WORK_AMT) AS (
                  SELECT 1 , 7120  FROM DUAL 
        UNION ALL SELECT 2 , 8150  FROM DUAL
        UNION ALL SELECT 3 , 8255  FROM DUAL
        UNION ALL SELECT 4 , 9051  FROM DUAL
        UNION ALL SELECT 5 , 1220  FROM DUAL
        UNION ALL SELECT 6 , 12515 FROM DUAL
        UNION ALL SELECT 7 , 13555 FROM DUAL
        UNION ALL SELECT 8 , 5221  FROM DUAL
        UNION ALL SELECT 9 , 812   FROM DUAL
        UNION ALL SELECT 10, 6562  FROM DUAL
    ),

-- A new way of calculating all possible sums by
-- Recursively adding up all the sums
    SUMS (SUBSET_SUM, MAX_ID, CALC) AS (
        SELECT 
            WORK_AMT, 
            ID, 
            TO_CHAR(WORK_AMT)
        FROM WORK
        
        UNION ALL
        
        SELECT 
            WORK_AMT + SUBSET_SUM, 
            WORK.ID,
            CALC || '+' || WORK_AMT
        FROM 
            SUMS
        JOIN 
            WORK
        ON 
            SUMS.MAX_ID < WORK.ID
    )

-- The same technique to match the "closest" sum
-- As before
SELECT 
    ASSIGN.ID, 
    ASSIGN.ASSIGN_AMT, 
    MIN (SUBSET_SUM) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CLOSEST_SUM,
    MIN (CALC) KEEP (
        DENSE_RANK FIRST 
        ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM)
    ) AS CALCULATION
FROM 
    SUMS
CROSS JOIN 
    ASSIGN
GROUP BY 
    ASSIGN.ID, ASSIGN.ASSIGN_AMT

The above now yields:

ID  ASSIGN_AMT  CLOSEST_SUM  CALCULATION
------------------------------------------------------------
1        25150        25133  7120 + 8150 + 9051 + 812
2        19800        19768  1220 + 12515 + 5221 + 812
3        27511        27488  8150 + 8255 + 9051 + 1220 + 812

Conclusion

SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features:

  • Window functions
  • KEEP FIRST (in Oracle only)
  • LATERAL JOIN (or APPLY
  • Recursive SQL

Did you like this article? Would you like to learn more about advanced SQL? Contact us to enquire about our advanced SQL training sessions, where we help you understand the simplicity of set-oriented thinking and calculations with SQL.

We’d like to point out that all of these solutions can be written in Java using jOOQ in a type safe manner as well.

jOOQ: The best way to write SQL in Java

Finally, a lot of this grounds is covered in more detail in any of the following articles. Enjoy learning, and enjoy SQL!

The Danger of Subtype Polymorphism Applied to Tuples

Java 8 has lambdas and streams, but no tuples, which is a shame. This is why we have implemented tuples in jOOλ – Java 8’s missing parts. Tuples are really boring value type containers. Essentially, they’re just an enumeration of types like these:

public class Tuple2<T1, T2> {
    public final T1 v1;
    public final T2 v2;

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }

    // [...]
}


public class Tuple3<T1, T2, T3> {
    public final T1 v1;
    public final T2 v2;
    public final T3 v3;

    public Tuple3(T1 v1, T2 v2, T3 v3) {
        this.v1 = v1;
        this.v2 = v2;
        this.v3 = v3;
    }

    // [...]
}

Writing tuple classes is a very boring task, and it’s best done using a source code generator.

Tuples in other languages and APIs

jOOλ‘s current version features tuples of degrees 0 – 16. C# and other .NET languages have tuple types between 1 – 8. There’s a special library just for tuples called Javatuples with tuples between degrees 1 and 10, and the authors went the extra mile and gave the tuples individual English names:

Unit<A> // (1 element)
Pair<A,B> // (2 elements)
Triplet<A,B,C> // (3 elements)
Quartet<A,B,C,D> // (4 elements)
Quintet<A,B,C,D,E> // (5 elements)
Sextet<A,B,C,D,E,F> // (6 elements)
Septet<A,B,C,D,E,F,G> // (7 elements)
Octet<A,B,C,D,E,F,G,H> // (8 elements)
Ennead<A,B,C,D,E,F,G,H,I> // (9 elements)
Decade<A,B,C,D,E,F,G,H,I,J> // (10 elements)

Why?

because Ennead really rings that sweet bell when I see it

Last, but not least, jOOQ also has a built-in tuple-like type, the org.jooq.Record, which serves as a base type for nice subtypes like Record7<T1, T2, T3, T4, T5, T6, T7>. jOOQ follows Scala and defines records up to a degree of 22.

Watch out when defining tuple type hierarchies

As we have seen in the previous example, Tuple3 has much code in common with Tuple2.

As we’re all massively brain-damaged by decades of object orientation and polymorphic design anti-patters, we might think that it would be a good idea to let Tuple3<T1, T2, T3> extend Tuple2<T1, T2>, as Tuple3 just adds one more attribute to the right of Tuple2, right? So…

public class Tuple3<T1, T2, T3> extends Tuple2<T1, T2> {
    public final T3 v3;

    public Tuple3(T1 v1, T2 v2, T3 v3) {
        super(v1, v2);
        this.v3 = v3;
    }

    // [...]
}

The truth is: That’s about the worst thing you could do, for several reasons. First off, yes. Both Tuple2 and Tuple3 are tuples, so they do have some common features. It’s not a bad idea to group those features in a common super type, such as:

public class Tuple2<T1, T2> implements Tuple {
    // [...]
}

But the degree is not one of those things. Here’s why:

Permutations

Think about all the possible tuples that you can form. If you let tuples extend each other, then a Tuple5 would also be assignment-compatible with a Tuple2, for instance. The following would compile perfectly:

Tuple2<String, Integer> t2 = tuple("A", 1, 2, 3, "B");

When letting Tuple3 extend Tuple2, it may have seemed like a good default choice to just drop the right-most attribute from the tuple in the extension chain.

But in the above example, why don’t I want to re-assign (v2, v4) such that the result is (1, 3), or maybe (v1, v3), such that the result is ("A", 2)?

There are a tremendous amount of permutations of possible attributes that could be of interest when “reducing” a higher degree tuple to a lower degree one. No way a default of dropping the right-most attribute will be sufficiently general for all use-cases

Type systems

Much worse than the above, there would be drastic implications for the type system, if Tuple3 extended Tuple2. Check out the jOOQ API, for instance. In jOOQ, you can safely assume the following:

// Compiles:
TABLE1.COL1.in(select(TABLE2.COL1).from(TABLE2))

// Must not compile:
TABLE1.COL1.in(select(TABLE2.COL1, TABLE2.COL2).from(TABLE2))

The first IN predicate is correct. The left hand side of the predicate has a single column (as opposed to being a row value expression). This means that the right hand side of the predicate must also operate on single-column expressions, e.g. a SELECT subquery that selects a single column (of the same type).

The second example selects too many columns, and the jOOQ API will tell the Java compiler that this is wrong.

This is guaranteed by jOOQ via the Field.in(Select) method, whose signature reads:

public interface Field<T> {
    ...
    Condition in(Select<? extends Record1<T>> select);
    ...
}

So, you can provide a SELECT statement that produces any subtype of the Record1<T> type.

Luckily, Record2 does not extend Record1

If now Record2 extended Record1, which might have seemed like a good idea at first, the second query would suddenly compile:

// This would now compile
TABLE1.COL1.in(select(TABLE2.COL1, TABLE2.COL2).from(TABLE2))

… even if it forms an invalid SQL statement. It would compile because it would generate a Select<Record2<Type1, Type2>> type, which would be a subtype of the expected Select<Record1<Type1>> from the Field.in(Select) method.

Conclusion

Tuple2 and Tuple5 types are fundamentally incompatible types. In strong type systems, you mustn’t be lured into thinking that similar types, or related types should also be compatible types.

Type hierarchies are something very object-oriented, and by object-oriented, I mean the flawed and over-engineered notion of object orientation that we’re still suffering from since the 90s. Even in “the Enterprise”, most people have learned to favour Composition over Inheritance. Composition in the case of tuples means that you can well transform a Tuple5 to a Tuple2. But you cannot assign it.

In jOOλ, such a transformation can be done very easily as follows:

// Produces (1, 3)
Tuple2<String, Integer> t2_4 = 
    tuple("A", 1, 2, 3, "B")
    .map((v1, v2, v3, v4, v5) -> tuple(v2, v4));

// Produces ("A", 2)
Tuple2<String, Integer> t1_3 = 
    tuple("A", 1, 2, 3, "B")
    .map((v1, v2, v3, v4, v5) -> tuple(v1, v3));

The idea is that you operate on immutable values, and that you can easily extract parts of those values and map / recombine them to new values.

Read more

If you’ve enjoyed reading this article, you might also like to learn why recursive generics are a terrible idea (in many situations).

jOOQ Tuesdays: Markus Winand is on a Modern SQL Mission

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. markuswinandThis includes people who work with SQL, Java, Open Source, and a variety of other related topics.

We are excited to talk with Markus Winand in this sixth edition. Markus is the author of the popular book SQL Performance Explained and the even more popular website Use The Index, Luke, and we’re thrilled to see that he’s pulling off another stunt:

Hi Markus – You have recently launched modern-sql.com. What is your goal with this website?

My goal for modern-sql.com is to create a textbook and reference about the SQL goodies you didn’t learn in school or university. Interestingly, online manuals about these features are pretty sparse. They come in two fashions: blog posts and vendor documentation. Blog posts are usually one-off events covering a particular feature or use case. There are many great blogs out there – the jOOQ blog being one of them – but there is no one I could recommend to learn all about recent SQL features. Vendor documentation, on the other hand, is mostly a reference about syntax—quite often even a bad one: they often don’t mention standard compliance at all and tend to follow a “proprietary features first” approach.

The consequence is that SQL market is very fragmented: besides SQL-92, there is no obvious base that is common to all databases. This becomes particularly evident on the job market: job offers either require just SQL—meaning good old relational SQL—or they require experience with a specific product. That’s pretty much the norm nowadays and nobody questions it. However, how would you think about this job opening: “Google Chrome Web Developer.” Web developers can’t choose the client’s browser. Many tried, but failed. Remember “optimized for XYZ”? That’s why web developers demanded standard conformance from browsers over the past decades. Just having launched a new website I can say that CSS conformance has improved drastically over the past five years. Ultimately, I’d like the same thing to happen for SQL. I hope that modern-sql.com sparks interest in standard conforming SQL so that developers also start to demand standard conformance from the database vendors. Quite an ambitious goal.

Last year, you’ve gone into battle against the SQL OFFSET clause. Want to shed some light on the background of that campaign?

The most striking problem with OFFSET is that it is generally used for an invalid use case: pagination. In this case, OFFSET is used to skip over a number of rows in the intention to find the rows following the previously selected ones. However, OFFSET does per definition not return the rows following one that was selected earlier, but just discards the first N rows of the result. Coincidentally, OFFSET yields the expected result if the data has not changed in the meanwhile—a case that is pretty common during development. But as soon as rows are added or deleted, discarding a fixed number of rows just doesn’t give the right result. The correct approach is to remember the last row fetched and use this data in a WHERE clause to select the rows following. This approach is explained in detail at http://use-the-index-luke.com/no-offset.

Besides the fact that OFFSET cannot be used to implement correct pagination, OFFSET is also bad for performance. OFFSET is wrong and slow. What else do you need? As a matter of fact, the only valid use case I know for OFFSET it to implement SLEEP in SQL—not that I ever need that. Unfortunately, OFFSET made it into the SQL standard in 2011. I consider this the worst mistake in recent history of SQL because it can’t be corrected. The only good part is that it is an optional feature—vendors don’t need to implement for standard conformance. Nevertheless, Oracle and Microsoft just recently added OFFSET to their SQL databases.

You’ve written a very popular book on SQL Performance called SQL Performance explained. How does it compare to other SQL books and why should our readers buy it?

I’ll start with the second question. First of all you must know that the full content of SQL Performance Explained is available for free at http://use-the-index-luke.com/. Most people I’m asking why they bought the book did so because the like the web site. They bought the book either to support my work on Use The Index, Luke (greatly appreciated!) or, more importantly, to finally read the book from cover to cover. A typical answer I get goes along these lines: “I knew Use The Index, Luke for years and have read many articles there, but I finally wanted to read everything from the beginning to end.”

Now coming to the first question why the world needed another SQL performance tome: it didn’t. Therefore, I wrote a very small book that can be read in less than a day. I focused on the basic concepts, which are the same in most databases, and boldly skipped less common special cases. Its shortness is also most appreciated in the reviews. On the other hand, the book has occasionally being criticized as being incomplete—probably because the sub-title reads “Everything developers need to know about SQL performance”. Personally, I think these critiques somehow proof my point: Obviously, Java, PHP or .NET developers don’t need to know as much about SQL performance as database performance consultants. When writing for such an audience, you must skip a lot.

Where do you see SQL in 10 years from today?

I hope that the temporal features of SQL:2011 (see here) become commonly available—also in free open source databases. At the moment, they are only available in commercial databases—even there the completeness and standard conformance varies. I would also hope that the SQL standard finds a way to cope with the current trend that every database vendor adds its own proprietary set of JSON functions. Unfortunately, it might be too late for that already.

However, my greatest hope is that developers realize that SQL is not stuck in 1992. The standard has added many useful features since than. Most databases offer a good part of these features. It’s really just our perception of SQL that got stuck in 1992.

Learn more about Markus’s work

… Markus is giving his Modern SQL talk at conferences. Learn more about it here:

The 10 Most Popular DB Engines (SQL and NoSQL) in 2015

About two years ago, we’ve published this post about the 10 most popular DB engines, where we analyzed the data published by Solid IT on their DB Ranking website.

In the meantime, the Solid IT measurement system has found to be a credible source, such that the website has also been cited at Gartner, InfoWorld, and many other sources. For more details about how this is measured, check out the relevant website on db-engines.com:
http://db-engines.com/en/ranking_definition

Comparing the top 10 list, we can see that the players have shifted, heavily:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

The top 3 elefants are still Oracle, MySQL and Microsoft SQL Server, but the runner-ups have changed. While PostgreSQL is still gaining traction, it has lost grounds compared to MongoDB.

Also, Cassandra and Redis have pushed out Sybase and Teradata from the top 10!

When 2 years ago, there had been only a single non RDBMS in this top 10 list, there are now 3, all of which “schema-less”, and they’re gaining additional momentum.

Clearly, vendors of RDBMS will need to move quickly to accommodate the needs of document storage and key-value storage the way their new competitors do.

For us on the jOOQ blog, a much more interesting perspective is to see where our supported databases currently are in this ranking:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

With the recent release of jOOQ 3.7, we’ve added another three databases to our list of now 21 supported RDBMS. Compared to last year’s ranking, almost all of these RDBMS are gaining traction as well, apart from SQL Server.

One thing is certain: We still have very exciting times ahead. Stay tuned for more updates, and check out the current ranking here:

http://db-engines.com/en/ranking

Semi Join and Anti Join Should Have Their Own Syntax in SQL

Relational algebra nicely describes the various operations that we know in SQL as well from a more abstract, formal perspective. One of the most common relational JOIN operations is the “equi-join” or SQL INNER JOIN.

equijoin

The above example “equi-joins” the ACTOR, FILM_ACTOR, and FILM tables from the Sakila database, in order to produce a new relation consisting of all the actors and all their associated films.

Relational operators without equivalent SQL syntax

In most cases, SQL is much much more powerful than relational algebra. However, there are three operators in relational algebra, that have no exact representation in SQL, and can only be expressed through “workarounds”. These operators are:

We’ll be looking only at the first two in this article.

The Wikipedia article on relational algebra nicely explains semi join and anti join visually:

Semi join

wiki-semi-join

As you can see, the semi join relation Employee ⋉ Dept only contains attributes from the Employee relation, not from the Dept relation. “Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.

In SQL, we would write the same relation using IN or EXISTS:

-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
  SELECT DeptName
  FROM Dept
)

-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
  SELECT 1
  FROM Dept
  WHERE Employee.DeptName = Dept.DeptName
)

Anti join

wiki-anti-join

As you can see, the anti join relaion Employee ▷ Dept only contains attributes from the Employee relation, not from the Dept relation. “Anti” means that we don’t really join the right hand side, we only check if a join would NOT yield results for any given tuple.

In SQL, we would write the same relation using NOT IN or NOT EXISTS (although, in the case of NOT IN, we need to be extra careful with NULLs):

-- NOT IN
SELECT *
FROM Employee
WHERE DeptName NOT IN (
  SELECT DeptName
  FROM Dept
)

-- NOT EXISTS
SELECT *
FROM Employee
WHERE NOT EXISTS (
  SELECT 1
  FROM Dept
  WHERE Employee.DeptName = Dept.DeptName
)

A better SQL with native SEMI JOIN / ANTI JOIN

While the above IN / NOT IN and EXISTS / NOT EXISTS predicates are useful, they are not at all as expressive as native SEMI JOIN or ANTI JOIN support would be. Imagine, we could write the above statements like this, instead:

Semi join

-- Natural semi join
SELECT *
FROM Employee
NATURAL LEFT SEMI JOIN Dept

-- Semi join with USING clause
SELECT *
FROM Employee
LEFT SEMI JOIN Dept USING (DeptName)

-- Semi join with ON clause
SELECT *
FROM Employee e
LEFT SEMI JOIN Dept d ON e.DeptName = d.DeptName

Anti join

-- Natural anti join
SELECT *
FROM Employee
NATURAL LEFT ANTI JOIN Dept

-- Anti join with USING clause
SELECT *
FROM Employee
LEFT ANTI JOIN Dept USING (DeptName)

-- Anti join with ON clause
SELECT *
FROM Employee e
LEFT ANTI JOIN Dept d ON e.DeptName = d.DeptName

With all of the above options, SQL would be a much more concise language for those cases where we’d like to quickly semi/anti join two relations. In fact, many developers accidentally use INNER JOIN instead, because INNER JOIN can implement a SEMI JOIN when joining a 1:1 or a M:1 relationship. But when they get used to abusing INNER JOIN, they’ll do so as well for 1:N and M:N relationships, ending up with duplicates and removing those again with DISTINCT (see item #6 on this list of 10 common SQL mistakes)

Interestingly enough, Cloudera Impala’s SQL dialect supports these JOIN syntaxes:

SELECT select_list FROM
  table_or_subquery1 [INNER] JOIN table_or_subquery2 |
  table_or_subquery1 
    {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} 
      JOIN table_or_subquery2 |
  table_or_subquery1 
    {LEFT | RIGHT} SEMI JOIN table_or_subquery2 |
  table_or_subquery1 
    {LEFT | RIGHT} ANTI JOIN table_or_subquery2 |
    [ ON col1 = col2 [AND col3 = col4 ...] |
      USING (col1 [, col2 ...]) ]
  [other_join_clause ...]
[ WHERE where_clauses ]

And so will jOOQ 3.7

jOOQ, the best way to write SQL in Java

With jOOQ 3.7, you can now write exactly this useful short form:

Semi join

ctx.select()
   .from(Employee)
   .leftSemiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

Anti join

ctx.select()
   .from(Employee)
   .leftAntiJoin(Dept)
   .on(Employee.DeptName.eq(Dept.DeptName))
   .fetch();

jOOQ will make sure that the generated SQL correctly renders an equivalent [ NOT ] EXISTS predicate, regardless of how many JOIN expressions you choose to write.

Conclusion

SQL is still a moving target. Many many years after relational algebra has been made usefully accessible to our industry via SQL, however, we still do not have native support for all relational operators. Semi join and anti join are two of them, division is a third.

Cloudera Impala has shown how easy this syntax could be in an actual DBMS. We follow suit and added support as well.

Dear RDBMS vendors: Please add native SEMI JOIN and ANTI JOIN to your databases. Thank you.

A Beginner’s Guide to Using Java EE with jOOQ

Java EE ships with its own persistence API: JPA. JPA is most powerful when you want to map your RDBMS entities (tables / relations) to Java entities (classes), mostly following a 1:1 mapping strategy. The idea behind this is that often, business logic isn’t really set-oriented as relational algebra or SQL, but record-oriented, meaning that business rules and business logic is applied to individual records.

In other words, when SQL and relational algebra is about values (tuples), JPA is about identity and state (of individual records). And this is where JPA shines, because:

Life is too short to write CRUD with SQL

But as Gavin King always said:

gavin-king-on-hibernate

RDBMS are not just about CRUD

Gavin King was well aware of the OLAP hype that was going on at the time he started working on Hibernate, the most popular JPA implementation. Business intelligence, or data science as it is called nowadays, relies on much more advanced functionality than simple CRUD – functionality that has never been targeted by the JPA specification, or by its implementations.

In fact, you don’t necessarily have to do OLAP to benefit from native SQL, simpler use-cases in more ordinary OLTP environments can appear as well, such as

  • Reporting
  • Batch and bulk data processing
  • Query with complex business rules

While JPA offers JPQL and Criteria API, which will help you express some amount of complexity in your queries, you will eventually be limited by the features offered in those languages and APIs, as Michael Simons has recently documented in an interesting Criteria API to jOOQ comparison.

For this reason, all JPA implementations offer a way to query the database using “native SQL”. In a previous blog post, we’ve shown how you can leverage jOOQ’s type safe DSL API to run SQL queries via JPA’s native query API, and then fetch results…

In the above cases, jOOQ is used only as a SQL query builder, while query execution is left to JPA.

Do all database querying with jOOQ, in Java EE

Remember jOOQ’s philosophy:

jOOQ is essentially type safe JDBC. Nothing more.

Even if you can use JPA to execute native SQL, you don’t have to. You can operate directly on a JDBC level, something that is often required with JPA, e.g. when working…

  • … with vendor-specific data types
  • … with non-trivial stored procedures
  • … with statement batches
  • … with updatable cursors

When you run your application on an application server, you can pick the features that you want and need, and use proprietary APIs (such as jOOQ, which runs on top of JDBC) for the rest. For instance, you can use:

  • EJB for session and scope management
  • CDI for dependency injection
  • jOOQ for your database interaction

(you could also add JTA to the stack – for simplicity reasons, we’ll skip that for now)

The procedure is simple: Just inject a javax.sql.DataSource into your session bean using CDI:

@Stateless
public class LibraryEJB {

    @Resource(lookup="java:data-source-configuration")
    private DataSource ds;
}

… and start working with it using JDBC:

public List<Author> fetchAuthors() 
throws SQLException {
    List<Author> result = new ArrayList<>();

    // Get a Connection from the injected DataSource
    try(Connection con = ds.getConnection();
        PreparedStatement stmt = con.prepareStatement(
            "SELECT * FROM AUTHOR ORDER BY ID");
        ResultSet rs = stmt.executeQuery()
    ) {
        result.add(new Author(
            rs.getInt("ID"),
            rs.getString("FIRST_NAME"),
            rs.getString("LAST_NAME")
        ));
    }

    return result;
}

… or using jOOQ:

public Result<AuthorRecord> fetchAuthors() {

    // Pass the injected DataSource to jOOQ
    return DSL.using(ds, H2)
              .selectFrom(AUTHOR)
              .orderBy(AUTHOR.ID)
              .fetch();
}

Notice how jOOQ – by default – fetches all results eagerly into memory, closing resources like the JDBC Connection, PreparedStatement, and ResultSet eagerly, such that you’re not required to deal with the hassle of resource management yourself.

Again:

jOOQ is essentially type safe JDBC. Nothing more.

JDBC has always been an important part of Java EE applications, for all sorts of reasons, including access to vendor-specific features. jOOQ adds compile-time type safety on top of JDBC. Nothing more. Whatever works with JDBC will work with jOOQ.

In particular, jOOQ will never interfere with your transaction or session model, regardless of the choice you make. All that is needed by jOOQ is a JDBC Connection or DataSource.

Running an example in JBoss WildFly

The above example can be checked out from GitHub and run directly in WildFly, for example – or with only little adaptations in any other Java EE application server:

https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/jOOQ-javaee-example

The example was created for WildFly in the context of a Webinar with Arun Gupta. The webinar answers the following questions:

  • What is jOOQ ?
  • Why JOOQ when there is JDBC and JPA ?
  • How does it fit with Java EE apps ? Does it uses underlying JPA persistence provider or some other connection ?
  • Pros/cons over JPA ? Pure Hibernate ?
  • How well does it scale ?
  • Show code sample in a Java EE application
  • jOOQ for CRUD-based or domain-rich application ?
  • How can eventually all the work in jOOQ be integrated in JPA and be standardized ? Or would it be more of JDBC ?

The full webinar can be seen on YouTube, here:

How to Quickly Enumerate Indexes in Oracle 11gR2

Do you want to know real quick what kind of indexes there are on any given table in your Oracle schema? Nothing simpler than that. Just run the following query:

SELECT
  i.index_name,
  listagg(c.column_name, ', ') 
    WITHIN GROUP (ORDER BY c.column_position)
    AS columns
FROM all_indexes i
JOIN all_ind_columns c 
  ON i.index_name = c.index_name
WHERE i.table_name = 'FILM_ACTOR'
GROUP BY i.index_name

The above query is ran against the Sakila database. Just replace the “FILM_ACTOR” table by your table and you’re all set. The result looks like:

INDEX_NAME                COLUMNS
-------------------------------------------
IDX_FK_FILM_ACTOR_ACTOR   ACTOR_ID
IDX_FK_FILM_ACTOR_FILM    FILM_ID
SYS_C007155               ACTOR_ID, FILM_ID

Sometimes, it’s the simple things…

Explanation about LISTAGG

LISTAGG is a so-called ordered-set aggregate function, added in 11gR2, i.e. the aggregate function will aggregate the data “WITHIN a GROUP” given a specific “ORDER”. This is obviously useful for string concatenation, but it can be very useful also for a variety of other functions, such as inverse distribution functions, such as the MEDIAN, which we’ve blogged about here.

The syntax is:

function(...) WITHIN GROUP (ORDER BY ...)