jOOQ Tuesdays: Chris Saxon Explains the 3 Things Every Developer Should Know About SQL

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

chris-saxon-headshot[1]

I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.

Chris, you’re part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom’s website at least once. You’ve answered an incredible amount of questions already. What’s it like to work for such a big community as Oracle’s?

It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.

The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!

What were your three most interesting questions, so far?

Any question that has a clear definition and a complete test case is interesting! ;) Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:

1. Finding the previous row in a different group

The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.

At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.

I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.

This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!

I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.

2. Improving SQL that deliberately generates cartesian product

The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.

The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.

There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.

The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.

To overcome these you need to ask: “how can I join to the abbreviation table just once?”

The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.

You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!

The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:

3. Difference in performance SQL vs PL/SQL

The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.

The query in question was in the form:

update table1
set col1 = (select col2 from table2 where t1.code = t2.code);

It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!

The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.
 

The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!
 

The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).

This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:

merge into table1
using  table2
on   (t1.code = t2.code)
when matched
  then update set t1.col = t2.col;

This came out significantly faster than both the original update and PL/SQL – without needing any extra indexes!

You’re running a YouTube channel called “The Magic of SQL”. Are SQL developers magicians?

Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:

“Any sufficiently advanced technology is indistinguishable from magic”

The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!

I think SQL developers have a special kind of magic though :). The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.

Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they’ve been in databases like Oracle for a very long time. Why do you think they’re still so “obscure”?

Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.

Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.

Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that? ;)

What are three things that every developer should know about SQL?

1. Understand set based processing

If you find yourself writing a cursor loop (select … from … loop), and inside that loop you run more SQL, you’re doing it wrong.

Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?

If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.

It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.

If a ship_order function exists, you could write something like:

select order_id from unshipped_orders loop
  ship_order ( order_id );
end loop;

The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.

The way to make this faster is to process all the orders in one go. You can do this with SQL like:

insert into shipments
  select … from unshipped_orders;

update unshipped_orders
set  shipment_date = sysdate;

You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.

But you can overcome this! With update’s returning clause, you can get values from all the changed rows:

update unshipped_orders
set  shipment_date = sysdate
returning order_id bulk collect into order_array;

This gives you all the order ids to use as you need.

2. Learn what an execution plan is and how to create and read one

“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!

For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!

So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.

An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.

3. Use bind variables

Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.

This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.

And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!

Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.

Make sure your company is not tomorrow’s data leak headline. Use bind variables!

Last but not least: When will Oracle have a BOOLEAN type? :)

We have a BOOLEAN type! It’s just only in PL/SQL ;P

There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises ;)

Three-State Booleans in Java

Every now and then, I miss SQL’s three-valued BOOLEAN semantics in Java. In SQL, we have:

  • TRUE
  • FALSE
  • UNKNOWN (also known as NULL)

Every now and then, I find myself in a situation where I wish I could also express this UNKNOWN or UNINITIALISED semantics in Java, when plain true and false aren’t enough.

Implementing a ResultSetIterator

For instance, when implementing a ResultSetIterator for jOOλ, a simple library modelling SQL streams for Java 8:

SQL.stream(stmt, Unchecked.function(r ->
    new SQLGoodies.Schema(
        r.getString("FIELD_1"),
        r.getBoolean("FIELD_2")
    )
))
.forEach(System.out::println);

In order to implement a Java 8 Stream, we need to construct an Iterator, which we can then pass to the new Spliterators.spliteratorUnknownSize() method:

StreamSupport.stream(
  Spliterators.spliteratorUnknownSize(iterator, 0), 
  false
);

Another example for this can be seen here on Stack Overflow.

When implementing the Iterator interface, we must implement hasNext() and next(). Note that with Java 8, remove() now has a default implementation, so we don’t need to implement it any longer.

While most of the time, a call to next() is preceded by a call to hasNext() exactly once, nothing in the Iterator contract requires this. It is perfectly fine to write:

if (it.hasNext()) {
    // Some stuff

    // Double-check again to be sure
    if (it.hasNext() && it.hasNext()) {

        // Yes, we're paranoid
        if (it.hasNext())
            it.next();
    }
}

How to translate the Iterator calls to backing calls on the JDBC ResultSet? We need to call ResultSet.next().

We could make the following translation:

  • Iterator.hasNext() == !ResultSet.isLast()
  • Iterator.next() == ResultSet.next()

But that translation is:

  • Expensive
  • Not dealing correctly with empty ResultSets
  • Not implemented in all JDBC drivers (Support for the isLast method is optional for ResultSets with a result set type of TYPE_FORWARD_ONLY)

So, we’ll have to maintain a flag, internally, that tells us:

  • If we had already called ResultSet.next()
  • What the result of that call was

Instead of creating a second variable, why not just use a three-valued java.lang.Boolean. Here’s a possible implementation from jOOλ:

class ResultSetIterator<T> implements Iterator<T> {

    final Supplier<? extends ResultSet>  supplier;
    final Function<ResultSet, T>         rowFunction;
    final Consumer<? super SQLException> translator;

    /**
     * Whether the underlying {@link ResultSet} has
     * a next row. This boolean has three states:
     * <ul>
     * <li>null:  it's not known whether there 
     *            is a next row</li>
     * <li>true:  there is a next row, and it
     *            has been pre-fetched</li>
     * <li>false: there aren't any next rows</li>
     * </ul>
     */
    Boolean hasNext;
    ResultSet rs;

    ResultSetIterator(
        Supplier<? extends ResultSet> supplier, 
        Function<ResultSet, T> rowFunction, 
        Consumer<? super SQLException> translator
    ) {
        this.supplier = supplier;
        this.rowFunction = rowFunction;
        this.translator = translator;
    }

    private ResultSet rs() {
        return (rs == null) 
             ? (rs = supplier.get()) 
             :  rs;
    }

    @Override
    public boolean hasNext() {
        try {
            if (hasNext == null) {
                hasNext = rs().next();
            }

            return hasNext;
        }
        catch (SQLException e) {
            translator.accept(e);
            throw new IllegalStateException(e);
        }
    }

    @Override
    public T next() {
        try {
            if (hasNext == null) {
                rs().next();
            }

            return rowFunction.apply(rs());
        }
        catch (SQLException e) {
            translator.accept(e);
            throw new IllegalStateException(e);
        }
        finally {
            hasNext = null;
        }
    }
}

As you can see, the hasNext() method locally caches the hasNext three-valued boolean state only if it was null before. This means that calling hasNext() several times will have no effect until you call next(), which resets the hasNext cached state.

Both hasNext() and next() advance the ResultSet cursor if needed.

Readability?

Some of you may argue that this doesn’t help readability. They’d introduce a new variable like:

boolean hasNext;
boolean hasHasNextBeenCalled;

The trouble with this is the fact that you’re still implementing three-valued boolean state, but distributed to two variables, which are very hard to name in a way that is truly more readable than the actual java.lang.Boolean solution. Besides, there are actually four state values for two boolean variables, so there is a slight increase in the risk of bugs.

Every rule has its exception. Using null for the above semantics is a very good exception to the null-is-bad histeria that has been going on ever since the introduction of Option / Optional

In other words: Which approach is best? There’s no TRUE or FALSE answer, only UNKNOWN ;-)

Be careful with this

However, as we’ve discussed in a previous blog post, you should avoid returning null from API methods if possible. In this case, using null explicitly as a means to model state is fine because this model is encapsulated in our ResultSetIterator. But try to avoid leaking such state to the outside of your API.

The JDBC Boolean Compatibility List

Interestingly, boolean types have been introduced only late in the SQL standard, namely in SQL:1999. Even today, not all databases natively support BOOLEAN or BIT types. Most importantly, we can still wait for them in Oracle for a while. Here’s “Ask Tom”‘s point of view from 2002 on the subject:

https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:6263249199595

Oracle. Why U No Boolean? tweet this

People have worked around this limitation by using numeric or string literals instead. For instance 1 / 0, Y / N, T / F or the SQL standard 'true' / 'false'.

Booleans in JDBC

From a JDBC API perspective, boolean values can be set as bind values through PreparedStatement.setBoolean() or fetched from result sets through ResultSet.getBoolean(), and similar methods. If your database supports booleans, the Java boolean type nicely maps to SQL BOOLEAN – even if Java’s Boolean wrapper type would have been a better fit to respect NULLs.

But if you’re storing boolean values in INTEGER, CHAR(1) or VARCHAR(1) columns, things look differently in various databases. Consider the following example:

CREATE TABLE booleans (
  val char(1)
);

And then, run this Java program (we’re using jOOQ to keep things concise)

try {
    DSL.using(configuration)
       .execute(
       "insert into boolean (val) values (?)", true);
}
catch (Exception e) {
    e.printStackTrace();
}

DSL.using(configuration)
   .fetch("select * from booleans");

Not all databases / JDBC drivers support the above. These databases will run the above program:

  • Firebird (inserts ‘Y’ or ‘N’)
  • HSQLDB (inserts ‘1’ or ‘0’)
  • IBM DB2 (inserts ‘1’ or ‘0’)
  • MariaDB (inserts ‘1’ or ‘0’)
  • Microsoft Access (inserts ‘1’ or ‘0’)
  • MySQL (inserts ‘1’ or ‘0’)
  • Oracle (inserts ‘1’ or ‘0’)
  • SQL Server (inserts ‘1’ or ‘0’)
  • Sybase (inserts ‘1’ or ‘0’)

… whereas these databases will throw an exception:

  • CUBRID
  • Derby
  • H2
  • Ingres
  • PostgreSQL
  • SQLite

Booleans in the SQL standard

It is worth mentioning, that the SQL standard specifies how to deal with boolean to string conversion in the specification of the CAST() function:

6.13 <cast specification>
[...]
10) If TD is fixed-length character string, 
   then let LTD be the length in characters
   of TD.
[...]
e) If SD is boolean, then
Case:
i) If SV is True and LTD is not less than 4, 
   then TV is 'TRUE' extended on the right by
   LTD–4
s.
ii) If SV is False and LTD is not less than 5,
   then TV is 'FALSE' extended on the right by
   LTD–5 <space>s.
iii) Otherwise, an exception condition is 
   raised: data exception — invalid character
   value for cast.

So, most Open Source databases show what could be interpreted as the “correct” behaviour, even if from a historic perspective, 1/0 should be accepted behaviours. Beware of this limitation when using an Open Source test database!

For more information about this and the H2 database, please refer to this thread on the H2 user group.

SQL and booleans, some trivia

Some trivia about SQL and booleans:

SQL 1992 defines three values for a boolean:

<truth value> ::=
        TRUE
      | FALSE
      | UNKNOWN

True booleans aren’t always supported, though. Here’s the truth table of boolean support:

SQL Dialect Supports booleans
DB2 0 (use 1/0 instead)
Derby true (you can safely use true/false)
H2 true
HSQLDB true
Ingres true
MySQL true
Oracle 0
Postgres true
SQL Server 0
SQLite 0
Sybase ASE 0
Sybase SQL Anywhere 0

Trivia… but nice to know.