Creating Tables Dum and Dee in PostgreSQL


I was nerd-sniped:

So tables dee and dum are two theoretical tables in SQL, and they can be characterised as such:

[Dee] is the relation that has no attributes and a single tuple. It plays the role of True.

[Dum] is the relation that has no attributes and no tuples. It plays the role of False.

Quite academic? Sure. But the awesome PostgreSQL database can model these beasts! Check this out:

-- Creating the tables:
CREATE TABLE dum();
CREATE TABLE dee();
INSERT INTO dee DEFAULT VALUES;

-- Making sure the tables stay this way:
CREATE FUNCTION dum_trg ()
RETURNS trigger
AS $$
BEGIN
  RAISE EXCEPTION 'Dum must be empty';
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER dum_trg
BEFORE INSERT OR UPDATE OR DELETE OR TRUNCATE
ON dum
FOR EACH STATEMENT
EXECUTE PROCEDURE dum_trg();

CREATE FUNCTION dee_trg ()
RETURNS trigger
AS $$
BEGIN
  RAISE EXCEPTION 'Dee must keep one tuple';
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER dee_trg
BEFORE INSERT OR UPDATE OR DELETE OR TRUNCATE
ON dee
FOR EACH STATEMENT
EXECUTE PROCEDURE dee_trg();

And we’re done!

Check this out:

SELECT * FROM dum;

Nothing!

SELECT * FROM dee;

One row with no columns!

Just to be sure:

SELECT 'dum' AS t, count(*) 
FROM dum 
UNION ALL
SELECT 'dee' AS t, count(*) 
FROM dee;

And we’ll get, nicely:

Caveat

Note, it is worth mentioning that there are some flaws / “bugs” (in my opinion). This query:

SELECT DISTINCT * FROM dee;

… yields an error:

ERROR: SELECT DISTINCT must have at least one column
SQL state: 42601

I suspect the author(s) of the DISTINCT operation have overlooked a nice feature here. UNION on the other hand doesn’t work correctly either. It doesn’t remove duplicates (but also doesn’t complain:

SELECT * FROM dee
UNION
SELECT * FROM dee

This yields:

, which is surprising, because when we nest the record with the following useful PostgreSQL specific syntax, we’ll get a single nested empty row:

SELECT dee FROM dee
UNION
SELECT dee FROM dee

When we use EXCEPT or INTERSECT, however, clearly, we’re running in a bug. Both of these queries return the same result:

SELECT * FROM dee
EXCEPT
SELECT * FROM dee

SELECT * FROM dee
INTERSECT
SELECT * FROM dee

The same result as the UNION:

In any case: Every database schema should have these. Much more powerful than Oracle’s DUAL table. With this, have a nice weekend!

Faster SQL Through Occasionally Choosing Natural Keys Over Surrogate Keys


There are many many opinions out there regarding the old surrogate key vs. natural key debate. Most of the times, surrogate keys (e.g. sequence generated IDs) win because they’re much easier to design:

  • They’re easy to keep consistent across a schema (e.g. every table has an ID column, and that’s always the primary key)
  • They’re thus a no-brainer to add. When you create a new table, you don’t need to worry about any candidate keys
  • They’re guaranteed to be unique, because they have absolutely no business value, only a technical value

Great. So why bother using natural keys in the first place?

Well, there is a very compelling reason!

Performance!

Whenever you introduce surrogate keys, this means that your key data becomes completely meaningless. From a design perspective, that’s not too bad. You can easily join that other table to get the interesting, meaningful information that hides behind the surrogate foreign key value. For example, in our Sakila database

… we have a typical many-to-many relationship modelled with a relationship table between the FILM table and the CATEGORY table – state-of-the-art normalisation. But check out this interesting thing:

  • The FILM_CATEGORY relationship table doesn’t contain any interesting information at all. Just the relationships
  • The category table only contains a single useful column: The NAME column
  • The remaining columns (CATEGORY_ID and LAST_UPDATE) have absolutely no meaning

With this in mind, we could design a much simpler schema, where we use the category name as a natural key, and in fact, we don’t even need the CATEGORY table anymore, we can now remove it (that’s optional here. To ensure data correctness, we could keep it around, containing only the NAME column as a primary key). Check this out:

Now, if we run a query like the following one against our entire Sakila schema:

SELECT c.name, count(*)
FROM film_actor fa USING (actor_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

The query finds all categories a given actor played in, and the number of films that the given actor played in each category. For instance, this could be the result:

NAME       COUNT(*)
-------------------
Horror     3
Classics   2
Family     2
New        2
Games      2
Animation  1
Sports     1
Children   1
...

With an alternative schema where the category NAME has been moved to a new FILM_CATEGORY_NATURAL table, we could run this much simpler query:

SELECT fc.name, count(*)
FROM film_actor fa 
JOIN film_category_natural fc 
  USING (film_id)
WHERE actor_id = 1
GROUP BY fc.name
ORDER BY count(*) DESC

Notice how we can omit an entire JOIN.

The execution plans (here on Oracle) are quite different. Check this out:

Before:

After:

Unfortunately, the cost difference (8 vs 5) cannot be taken as a tool to compare actual costs between the two queries/plans. But the plans are otherwise very similar, except that we’re simply missing one table access (CATEGORY and an entire JOIN). That’s a significant improvement for something this simple. Imagine the improvement if we could roll out this kind of better query throughout the system?

We could look into more execution plan measurements (especially from the actual plan results), but what if we simply benchmark the two queries using the same silly benchmark, as always, repeating each statement 100 times:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT c.name, count(*)
      FROM film_actor fa USING (actor_id)
      JOIN film_category fc USING (film_id)
      JOIN category c USING (category_id)
      WHERE actor_id = 1
      GROUP BY c.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT fc.name, count(*)
      FROM film_actor fa 
      JOIN film_category_natural fc 
        USING (film_id)
      WHERE actor_id = 1
      GROUP BY fc.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The results are drastic:

Statement 1 : 00:00:00.122070000
Statement 2 : 00:00:00.051179000

A factor of 2.5x faster!

(Disclaimer: Benchmarks are an inaccurate way of measuring things because they suffer (or benefit) from “unfair” side-effects like heavy caching, but they’re a helpful tool to assess the order of magnitude of a difference between two queries).

The JOIN is completely unnecessary

The only reason why we joined the CATEGORY table each time is because we needed to display the meaningful business value of a category to the user, the CATEGORY.NAME. We could have avoided the JOIN if that was not a requirement, but displaying surrogate keys to the user (the CATEGORY_ID) would be rather harsh, wouldn’t it?

And we’re doing this all the time with all sorts of tables. We have:

  • Category tables (category name is a good candidate key)
  • Translation tables (label is a good candidate key)
  • Country tables (ISO 3166 country codes are a good candidate key)
  • Language tables (ISO 639 language codes are a good candidate key)
  • Currency tables (ISO 4217 currency codes are a good candidate key)
  • Stock symbol tables (ISIN security codes are a good candidate key)
  • … and many more

Working with natural keys can be quite cumbersome. But in some entities, the internationally standardised codes are really good candidate keys, and most of the time, they’re sufficient. What does the LANGUAGE_ID 47 even mean? It means nothing. An experienced DBA will remember, after a while, that it means “English”. But wouldn’t EN be a much better value?

You would have EN as a primary key AND foreign key value, so chances are, because everyone (including frontend developers who probably hard-code some translations anyway) knows what language EN is (but no one knows what 47 means), you will almost never need to join the language table again – except in those rare cases where you want to work with secondary language columns, such as, for instance, DESCRIPTION.

Now, imagine what happens if we search for English entries, or as in our previous example, for films of category “Classics”? Our entire JOIN graph would be simplified.

(Another place where we don’t need additional surrogate keys is the relationship table. In this particular case, there’s no such key anyway)

Caveats

Our category strings are quite short. If natural keys become longer, then the duplication itself can become a problem on a lower storage level, as you might need more pages and blocks to store the same amount of rows.

Please, do take this advice in this article with a grain of salt. The essence here is to not always follow strict rules that were established only as a good default. There are always tradeoffs!

Conclusion

This should be quite a straightforward refactoring for many applications. If your tables are extremely obvious picks for a natural key (like the above), then do use natural keys. Your queries will immediately be faster – not necessarily much faster, but probably you can speed up a significant number of queries, i.e. take load off your entire system. Plus: your database will be more user-friendly.

And all of this at the price of not using the identical table design everywhere. I mean – when was having an identical table design a real business case anyway, right?

Side-note

In some cases, you could take this even one step further and denormalise your schema by putting categories as arrays or XML or JSON data structures directly inside your films. You’ll lose the normalisation benefits, but you could further win in terms of performance. For more details, read this very interesting article (about PostgreSQL) here:

http://www.databasesoup.com/2015/01/tag-all-things.html

Many SQL Performance Problems Stem from “Unnecessary, Mandatory Work”


Probably the most impactful thing you could learn about when writing efficient SQL is indexing. A very close runner-up, however, is the fact that a lot of SQL clients demand tons of “unnecessary, mandatory work” from the database.

Repeat this after me:

Unnecessary, Mandatory Work

What is “unnecessary, mandatory work”? It’s two things (duh):

Unnecessary

Let’s assume your client application needs this information here:

Nothing out of the ordinary. We run a movie database (e.g. the Sakila database) and we want to display the title and rating of each film to the user.

This is the query that would produce the above result:

SELECT title, rating
FROM film

However, our application (or our ORM) runs this query instead:

SELECT *
FROM film

What are we getting? Guess what. We’re getting tons of useless information:

There’s even some complex JSON all the way to the right, which is loaded:

  • From the disk
  • Into the caches
  • Over the wire
  • Into the client memory
  • And then discarded

Yes, we discard most of this information. The work that was performed to retrieve it was completely unnecessary. Right? Agreed.

Mandatory

That’s the worse part. While optimisers have become quite smart these days, this work is mandatory for the database. There’s no way the database can know that the client application actually didn’t need 95% of the data. And that’s just a simple example. Imagine if we joined more tables…

So what, you think? Databases are fast? Let me offer you some insight you may not have thought of, before:

Memory consumption

Sure, the individual execution time doesn’t really change much. Perhaps, it’ll be 1.5x slower, but we can handle that right? For the sake of convenience? Sometimes that’s true. But if you’re sacrificing performance for convenience every time, things add up. We’re no longer talking about performance (speed of individual queries), but throughput (system response time), and that’s when stuff gets really hairy and tough to fix. When you stop being able to scale.

Let’s look at execution plans, Oracle this time:

--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 |   166K|
|   1 |  TABLE ACCESS FULL| FILM |  1000 |   166K|
--------------------------------------------------

Versus

--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 | 20000 |
|   1 |  TABLE ACCESS FULL| FILM |  1000 | 20000 |
--------------------------------------------------

We’re using 8x as much memory in the database when doing SELECT * rather than SELECT film, rating. That’s not really surprising though, is it? We knew that. Yet we accepted it in many many of our queries where we simply didn’t need all that data. We generated needless, mandatory work for the database, and it does sum up. We’re using 8x too much memory (the number will differ, of course).

Now, all the other steps (disk I/O, wire transfer, client memory consumption) are also affected in the same way, but I’m skipping those. Instead, I’d like to look at…

Index usage

Most databases these days have figured out the concept of covering indexes. A covering index is not a special index per se. But it can turn into a “special index” for a given query, either “accidentally,” or by design.

Check out this query:

SELECT * 
FROM actor
WHERE last_name LIKE 'A%'

There’s no extraordinary thing to be seen in the execution plan. It’s a simple query. Index range scan, table access, done:

-------------------------------------------------------------------
| Id  | Operation                   | Name                | Rows  |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                     |     8 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR               |     8 |
|*  2 |   INDEX RANGE SCAN          | IDX_ACTOR_LAST_NAME |     8 |
-------------------------------------------------------------------

Is it a good plan, though? Well, if what we really needed was this, then it’s not:

Sure, we’re wasting memory etc. But check out this alternative query:

SELECT first_name, last_name 
FROM actor
WHERE last_name LIKE 'A%'

Its plan is this:

----------------------------------------------------
| Id  | Operation        | Name            | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT |                 |     8 |
|*  1 |  INDEX RANGE SCAN| IDX_ACTOR_NAMES |     8 |
----------------------------------------------------

We could now eliminate the table access entirely, because there’s an index that covers all the needs of our query… a covering index. Does it matter? Absolutely! This approach can speed up some of your queries by an order of magnitude (or slow them down by an order of magnitude when your index stops being covering after a change).

You cannot always profit from covering indexes. Indexes come with their own cost and you shouldn’t add too many of them, but in cases like these, it’s a no-brainer. Let’s run a benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Worst query: Memory overhead AND table access
      SELECT *
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Better query: Still table access
      SELECT /*+INDEX(actor(last_name))*/ 
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Best query: Covering index
      SELECT /*+INDEX(actor(last_name, first_name))*/ 
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The result is:

Statement 1 : +000000000 00:00:02.479000000
Statement 2 : +000000000 00:00:02.261000000
Statement 3 : +000000000 00:00:01.857000000

Note, the actor table only has 4 columns, so the difference between statements 1 and 2 is not too impressive, but still significant. Note also I’m using Oracle’s hints to force the optimiser to pick one or the other index for the query. Statement 3 clearly wins in this case. It’s a much better query, and that’s just an extremely simple query.

Again, when we write SELECT *, we create needless, mandatory work for the database, which it cannot optimise. It won’t pick the covering index because that index has a bit more overhead than the LAST_NAME index that it did pick, and after all, it had to go to the table anyway to fetch the useless LAST_UPDATE column, for instance.

But things get worse with SELECT *. Consider…

SQL transformations

Optimisers work so well, because they transform your SQL queries (watch my recent talk at Voxxed Days Zurich about how this works). For instance, there’s a SQL transformation called “JOIN elimination”, and it is really powerful. Consider this auxiliary view, which we wrote because we grew so incredibly tired of joining all these tables all the time:

CREATE VIEW v_customer AS
SELECT 
  c.first_name, c.last_name, 
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

This view just connects all the “to-one” relationships between a CUSTOMER and their different ADDRESS parts. Thanks, normalisation.

Now, after a while working with this view, imagine, we’ve become so accustomed to this view, we forgot all about the underlying tables. And now, we’re running this query:

SELECT *
FROM v_customer

We’re getting quite some impressive plan:

----------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |   599 | 47920 |   14 |
|*  1 |  HASH JOIN           |          |   599 | 47920 |   14 |
|   2 |   TABLE ACCESS FULL  | COUNTRY  |   109 |  1526 |    2 |
|*  3 |   HASH JOIN          |          |   599 | 39534 |   11 |
|   4 |    TABLE ACCESS FULL | CITY     |   600 | 10800 |    3 |
|*  5 |    HASH JOIN         |          |   599 | 28752 |    8 |
|   6 |     TABLE ACCESS FULL| CUSTOMER |   599 | 11381 |    4 |
|   7 |     TABLE ACCESS FULL| ADDRESS  |   603 | 17487 |    3 |
----------------------------------------------------------------

Well, of course. We run all these joins and full table scans, because that’s what we told the database to do. Fetch all this data.

Now, again, imagine, what we really wanted on one particular screen was this:

Yeah, duh, right? By now you get my point. But imagine, we’ve learned from the previous mistakes and we’re now actually running the following, better query:

SELECT first_name, last_name
FROM v_customer

Now, check this out!

------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |   599 | 16173 |     4 |
|   1 |  NESTED LOOPS      |             |   599 | 16173 |     4 |
|   2 |   TABLE ACCESS FULL| CUSTOMER    |   599 | 11381 |     4 |
|*  3 |   INDEX UNIQUE SCAN| SYS_C007120 |     1 |     8 |     0 |
------------------------------------------------------------------

That’s a drastic improvement in the execution plan. Our joins were eliminated, because the optimiser could prove they were needless, so once it can prove this (and you don’t make the work mandatory by selecting *), it can remove the work and simply not do it. Why is that the case?

Each CUSTOMER.ADDRESS_ID foreign key guarantees that there is exactly one ADDRESS.ADDRESS_ID primary key value, so the JOIN operation is guaranteed to be a to-one join which does not add rows nor remove rows. If we don’t even select rows or query rows, well, we don’t need to actually load the rows at all. Removing the JOIN provably won’t change the outcome of the query.

Databases do these things all the time. You can try this on most databases:

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual

-- More reasonable SQL dialects, e.g. PostgreSQL
SELECT EXISTS (SELECT 1 / 0)

In this case, you might expect an arithmetic exception to be raised, as when you run this query:

SELECT 1 / 0 FROM dual

yielding

ORA-01476: divisor is equal to zero

But it doesn’t happen. The optimiser (or even the parser) can prove that any SELECT column expression in a EXISTS (SELECT ..) predicate will not change the outcome of a query, so there’s no need to evaluate it. Huh!

Meanwhile…

One of most ORM’s most unfortunate problems is the fact that they make writing SELECT * queries so easy to write. In fact, HQL / JPQL for instance, proceeded to making it the default. You can even omit the SELECT clause entirely, because after all, you’re going to be fetching the entire entity, as declared, right?

For instance:

FROM v_customer

Vlad Mihalcea for instance, a Hibernate expert and Hibernate Developer advocate recommends you use queries almost every time you’re sure you don’t want to persist any modifications after fetching. ORMs make it easy to solve the object graph persistence problem. Note: Persistence. The idea of actually modifying the object graph and persisting the modifications is inherent.

But if you don’t intend to do that, why bother fetching the entity? Why not write a query? Let’s be very clear: From a performance perspective, writing a query tailored to the exact use-case you’re solving is always going to outperform any other option. You may not care because your data set is small and it doesn’t matter. Fine. But eventually, you’ll need to scale and re-designing your applications to favour a query language over imperative entity graph traversal will be quite hard. You’ll have other things to do.

Counting for existence

Some of the worst wastes of resources is when people run COUNT(*) queries when they simply want to check for existence. E.g.

Did this user have any orders at all?

And we’ll run:

SELECT count(*)
FROM orders
WHERE user_id = :user_id

Easy. If COUNT = 0: No orders. Otherwise: Yes, orders.

The performance will not be horrible, because we probably have an index on the ORDERS.USER_ID column. But what do you think will be the performance of the above compared to this alternative here:

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
) THEN 1 ELSE 0 END
FROM dual

-- Reasonable SQL dialects, like PostgreSQL
SELECT EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
)

It doesn’t take a rocket scientist to figure out that an actual existence predicate can stop looking for additional rows as soon as it found one. So, if the answer is “no orders”, then the speed will be comparable. If, however, the answer is “yes, orders”, then the answer might be drastically faster in the case where we do not calculate the exact count.

Because we don’t care about the exact count. Yet, we told the database to calculate it (needless), and the database doesn’t know we’re discarding all results bigger than 1 (mandatory).

Of course, things get much worse if you call list.size() on a JPA-backed collection to do the same…

I’ve blogged about this recently, and benchmarked the alternatives on different databases. Do check it out.

Conclusion

This article stated the “obvious”. Don’t tell the database to perform needless, mandatory work.

It’s needless because given your requirements, you knew that some specific piece of work did not need to be done. Yet, you tell the database to do it.

It’s mandatory because the database has no way to prove it’s needless. This information is contained only in the client, which is inaccessible to the server. So, the database has to do it.

This article talked about SELECT *, mostly, because that’s such an easy target. But this isn’t about databases only. This is about any distributed algorithm where a client instructs a server to perform needless, mandatory work. How many N+1 problems does your average AngularJS application have, where the UI loops over service result A, calling service B many times, instead of batching all calls to B into a single call? It’s a recurrent pattern.

The solution is always the same. The more information you give to the entity executing your command, the faster it can (in principle) execute such command. Write a better query. Every time. Your entire system will thank you for it.

If you liked this article…

… do also check out my recent talk at Voxxed Days Zurich, where I show some hyperbolic examples of why SQL will beat Java at data processing algorithms every time:

Impress Your Coworkers With the Incredible NATURAL FULL OUTER JOIN!


There are already only very few real-world use-cases for FULL [ OUTER ] JOIN, but maybe, you have run into this beast in the past. But when was the last time you’ve seen a NATURAL JOIN? Right.

A quick reminder from our article about JOINs:

FULL JOIN

A FULL JOIN is a type of OUTER JOIN that retains data from both sides of the JOIN operation, regardless if the JOIN predicate matches or not. For example, querying the Sakila database:

SELECT first_name, last_name, title
FROM actor
FULL JOIN film_actor USING (actor_id)
FULL JOIN film USING (film_id)

The result will look something like this:

first_name   last_name   title
--[ LEFT JOIN ] -------------------------
Jon          Doe                          <-- Actors that didn't play in any film
Claire       Miller                      
--[ INNER JOIN ] ------------------------
Samuel       Jackson     Django Unchained <-- Actors played in many films
Samuel       Jackson     Pulp Fiction
John         Travolta    Pulp Fiction     <-- Films featured several actors
--[ RIGHT JOIN ]-------------------------
                         Meh              <-- Films that don't have actors

Note the special “markers” that delimit the parts that would have also been contributed by LEFT JOIN, INNER JOIN, RIGHT JOIN. In a way, a FULL JOIN combines them all.

Handy, occasionally

NATURAL JOIN

This one is less obviously useful. It looks useful at first, because if we carefully design our schemas as in the Sakila database used above, then we can simplify our joins as such:

SELECT first_name, last_name, title
FROM actor
NATURAL JOIN film_actor
NATURAL JOIN film

Which is just syntax sugar for this:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (<all common columns>)
JOIN film USING (<all common columns>)

So, a NATURAL JOIN will join two tables using all the columns they have in common. You’d think that this would be the useful outcome:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)

Unfortunately, the outcome looks more like this:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id, last_update)
JOIN film USING (film_id, last_update)

In other words: It makes no sense at all. In real world situations (and the Sakila database is already close enough to the real world), we’ll always have accidentally conflicting column names that will be considered for NATURAL JOIN, when they hsould not be.

OK, so why NATURAL FULL JOIN?

… you’re asking. One is very rarely needed, and the other is pretty useless in the real world. Not so fast!

I recently helped a customer who had two very similar tables that both contained payments. Let’s call them: PAYMENT and STANDING_ORDER, the latter being a PAYMENT template which generates actual PAYMENTs periodically.

Now, there was some PL/SQL logic that operated on payments, which after some change request, needed to operate both on payments and/or on standing orders. If this was Java, we’d just (ab)use subtype polymorphism for the quick win, or factor out common logic into common component types for the thorough solution. For instance, these kinds of columns certainly appear in both cases:

“System” columns

  • ID – The surrogate key
  • CREATED_AT – Audit info
  • MODIFIED_AT – Audit info
  • DELETED – Soft deletion

“Business” columns

  • AMOUNT
  • CURRENCY

There are many more common columns. There are other columns that are specific to one or the other type:

Payment only

  • STANDING_ORDER_ID – Only payments reference the standing order that may have created them.

Standing order only

  • PERIODICITY – This is a standing order’s core feature.

Now, in order to refactor that PL/SQL logic, I needed a row type that combines all these columns easily. I didn’t want to manually track the exact columns of both tables, I wanted this to always work, even if the maintainers of the tables add/remove/rename columns.

NATURAL FULL JOIN to the rescue

I created a view like this:

CREATE VIEW like_a_payment
SELECT *
FROM payment
NATURAL FULL JOIN standing_order
WHERE 1 = 0

This view itself doesn’t do anything. It just sits there and produces the row type I need. Why? Because the NATURAL JOIN will join by ALL the common columns, namely ID, CREATED_AT, MODIFIED_AT, DELETED, AMOUNT, CURRENCY and then adds the columns from the left table, namely STANDING_ORDER_ID, and then from the right table, namely PERIODICITY.

I can now use this type in PL/SQL:

DECLARE
  paym like_a_payment%ROWTYPE;
BEGIN
  -- Do things...
END;

But why FULL?

True, I didn’t need FULL yet, because I’m not producing any rows anyway. Correct. But observe how I will populate the PAYM local variable in the PL/SQL block:

DECLARE
  l_paym_id like_a_payment.id%TYPE;
  l_paym    like_a_payment%ROWTYPE;
BEGIN
  l_paym_id := 1;

  SELECT *
  INTO l_paym
  FROM (
    SELECT * FROM payment WHERE id = l_paym_id
  )
  NATURAL FULL JOIN (
    SELECT * FROM standing_order WHERE id = l_paym_id
  );
END;

And I’m done! This could not have been done any more easily!

  • I’m using NATURAL because now I have to use it all the time to populate the record (or select individual columns, but meh)
  • I’m using FULL because only one of the two queries will produce a result, and the remaining columns will be NULL

Excellent!

Some observations

Of course, this is kind of a hack. Please don’t abuse this too often. Here’s some hints to follow:

  • The refactoring would have been substantial. I needed a quick win.
  • The logic actually needs almost all the columns from the payments. Otherwise, you might want to review that SELECT * usage.
  • The derived tables are necessary in Oracle for performance reasons.
  • The actual code contained two distinct queries, each one joining “one side” with an empty substitute query, as we know that the ID is either a payment or a standing order.

Having said so, now you know one of the very few real world use-cases of NATURAL FULL JOIN. Put this in your code base and impress your coworkers!

jOOQ Tuesdays: Brett Wooldridge Shows What it Takes to Write the Fastest Java Connection Pool


Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

brett-wooldridge

I’m very excited to feature today Brett Wooldridge, creator of HikariCP, the fastest connection pool available for Java.

Brett, you’ve created one of the most popular connection pools for Java: HikariCP. What made your library so popular?

I’ll provide some backstory on HikariCP before I answer that, but I’ll tease the answer by saying “marketing“.

A few years ago I was creating a product prototype for the company I work for, and I needed a connection pool. Like most developers I just wanted to drop in a pool and move on, so I took to the web to find the most popular and actively maintained library. Unfortunately, while load testing the prototype we started encountering deadlocks, and exceptions indicating connection state bleed over between threads.

Because the pool was open source, I thought I’d just pull down the code, find and fix the problems, and contribute back. But when I opened the code, I found thousands of lines more code than I was expecting.  Added to the mix were many locks, nested, sometimes acquired in one method and released in some distant place. There was simply no way to reason about where potential deadlocks lurked, even if we found and fixed the ones we encountered.

I picked up another pool and inspected its code. The lock semantics were clearer, but the volume of code was still more than 2x what I expected, especially given that it was delegating the core pooling logic to a separate library.

In addition, all of the pools I studied violated JDBC contracts in multiple ways. In as much as it is possible, a pool should return a Connection that is indistinguishable from one received in the absence of the pool. But these pools didn’t automatically close Statements when a connection was “closed” (returned), or clear warnings, or rollback uncommitted transactions, and they didn’t reset properties altered by the user such as auto-commit or transaction isolation level, and more; resulting in the next consumer getting a “dirty” connection.

I thought, “Really? This is the state of connection pools in the ecosystem after 20 years of Java?” Out of necessity and frustration, I created HikariCP.

To be fair, since I started HikariCP some pools have made some of these “correctness” behaviors configurable, but none of them do so by default and I suspect most users are running with the safety off.  At least two popular pools fail to complete our benchmark with OutOfMemory exceptions when they are enabled.  Conversely, HikariCP doesn’t support an unsafe mode of operation.

Returning to your question, as noted above there were many established pools available, so how did HikariCP become popular?  “Correctness” and reliability are a tough sell, so I focused on promoting performance, and started with a simple tweet. One follower led to another.  Some users tweeted about big performance gains, and improved reliability, and at some point in 2015 the Wix engineering team wrote a blog about switching to HikariCP.

In essence, simple word of mouth has led to HikariCP’s rising popularity, with an initial “marketing” push based on performance.  I do hope that over time more users will give equal weight to correctness and reliability, without which performance is meaningless, and for my part I plan to write more about those aspects of HikariCP.  

You quoted Edsger Dijkstra: “Simplicity is prerequisite for reliability.” – That reminds me of Antoine de Saint-Exupery’s “Perfection is Achieved Not When There Is Nothing More to Add, But When There Is Nothing Left to Take Away”. How do you manage to keep things simple when this world only ever gets more complicated?

Resisting complexity through feature-creep can be challenging.  I get a lot of requests for this or that feature, and while each may be simple in and of itself, if taken in totality would significantly increase complexity and code size.  Of course, that is not to say that I don’t add features.

For example, initial versions of HikariCP only supported a fixed size pool.  HikariCP was designed for systems with fairly constant load, and in that environment pools tend to stay at their maximum size, so I saw little need to complicate the code to support dynamic sizing.  Can you imagine a server at Google falling idle for several minutes?  Additionally, I feel like the more axes of configuration there are, the more difficult it is for users to optimally configure a pool.  However, eventually there were enough users who needed dynamic sizing, and its absence was a barrier to adoption, so support was added.  Principally, I did not want lack of dynamic sizing support to deprive users of the reliability and correctness benefits of HikariCP.

Still, I probably reject the vast majority of feature requests. As the custodian of HikariCP keeping it simple and true to that core philosophy is in the best interest of the community.  I always try to minimize the “surface area”, both in terms of code and configuration.  The larger the surface area of an API, the more difficult it is to comprehend.  Our brains have a limit for the amount of contextual information that can be held “in view” at one time; this is true in a lot of contexts.  For example, when reading code, methods larger than a certain size, or conditionals of more than a certain number of terms, are difficult to follow or reason about.  Generally, for users of HikariCP, the “surface area” is manifest in the number of configuration parameters.  While I can hardly say that “Perfection [has been] achieved”, I do feel like there is not much left to take away without cutting into functionality.

Few libraries go to the byte code level to optimise their code. While this helps in benchmarks, did it also help your users in production? What were the biggest caveats you found while micro-optimising?

Definitely.  Maybe some developers are dismissive of the potential gains, because in their minds they think, “What does it matter if connection acquisition takes 100ns or 100μs, the query is going to take 10ms anyway?”  However, pools intercept dozens of methods, and the “close()” path is typically slower than acquisition, so it’s not that simple.  I often get reports from users providing confirmation of real world performance improvements.  It’s anecdotal but one user initially commented in a bug report, “We’re testing HikariCP at the client and have had great initial success – an application loading 1 million records over multiple HTTP threads and putting them in the DB had its run time cut by 70% after moving from Tomcat CP to HikariCP!”  The follow-up comment on the bug was, “This was a bug in our side, using some unrelated non-threadsafe code.  No issue.  After fixing the bug, the code runs about 2x faster using HikariCP than Tomcat CP.”  That’s pretty good; and yet some reports surprise even me.

Regarding optimisation, and as long as we’re quoting famous thinkers, I would be remiss if I didn’t cite Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimisation is the root of all evil.”  I think the key word here is “premature”.  It is definitely better to write the code as it naturally comes and then, based on detailed profiling and benchmarking, perform “peephole optimisations” (to hijack a word from compiler theory).  At the same time, I would estimate that half of the performance gains in HikariCP have come as the result of algorithmic changes, rather than low-level optimisations.

Regarding caveats to micro-optimising, it would be hard to convey how much I have learned, and am still learning.  I’d like to give a shout-out to Aleksey Shipilëv for his excellent JMH micro-benchmark framework.  Aleksey has become somewhat of a JVM performance oracle (no pun intended, he used to work for Oracle).  The JVM performs an amazing array of optimisations, and if one is not careful then what appears to be a clever optimisation in the code simply confuses the JIT’s pattern-based optimiser and the result is slower rather than faster.

In order to effectively optimise on the JVM you sometimes end up reading the JIT source code, and you must become familiar with concepts such as dead code elimination, loop invariant hoisting, constant propagation, virtual call inlining, and many more.  Even with a good grip on these concepts I am sometimes surprised by the JVM in my attempts at optimisation.  In addition to the JIT, you really must understand the Java Memory Model (JMM) and how it maps onto CPU architectures like x86.

Lastly, after the design of algorithms, contention for shared state is the source of most bottlenecks (see the aforementioned JMM), so recently the biggest gains (for example, in v2.6.0) have come from tricks that simply avoid it; the fastest code is code that is never executed.

If there is a main takeaway, it is “trust the benchmarks”, your assumptions and intuitions are wrong more often than you imagine.

Your fellow jOOQ Tuesdays interviewee Vlad Mihalcea talked to us about queueing theory. How does this compare to what you wrote about connection pool sizing?

I have great respect for Vlad, I think we’re both members of the Mutual Admiration Society.  His FlexyPool is trying to solve a difficult problem; that being how to automatically tune optimal pool settings for varying loads.  Ultimately, the upper-bound is constrained by the database’s optimal concurrent query capacity, which is where my write-up on pool sizing comes into play.  However, there is a large amount of configuration space in-between a minimally sized pool and that upper-bound, which is where FlexyPool is trying to add value, by ensuring that the pool is “right sized”, dynamically, for the load it is servicing.

I say it is a difficult problem, because connection pools on modern multi-core servers likely present as a M/G/k queue in queueing theory; arrivals have a Markovian distribution, service times have a General distribution, and there are k servers (where “server” is defined as an abstract single-threaded processor).  Quoting wikipedia, “Most performance metrics for this queueing system are not known and remain an open problem.”  Modeling connection pools as a M/M/c queue might provide a decent approximation for the purposes of predicting queue lengths, but service times are not likely to have a Markovian distribution.  Of course, there are also non-Markovian stochastic models in queueing theory that could be applied.  Complicating everything is the fact that queued waiters (threads) can abandon the queue before service, for example when a timeout is reached.  That adds an additional twist when trying to predict queue lengths and wait times.  Hats off to Vlad for taking on this problem!

Anyway, what I wrote about setting the upper-bound on pool sizing translates to pinning the k (or c) value in those respective Markovian queueing theory models.

You chose a Japanese word in your product: 光 (Hikari, “Light”). What’s your connection to Japan?

I’ve lived and worked in Tokyo since 2008, though I think my Japanese is far behind where it should be given my time here.  I chalk that up to preferring time at the keyboard to language study.

As you mentioned, Hikari (pronounced Hi-ka-lee) translates to “Light” (as in sunlight).  In English, it is a double entendre in the context of HikariCP; though in Japanese it would not be.  “Light” in the sense of “the speed of…”, and “light” in the sense of being light in terms of code weight.

Beautiful SQL: Lateral Unnesting of Array Columns


Sometimes, SQL can just be so beautiful. One of the less mainstream features in SQL is the array type (or nested collections). In fact, it’s so not mainstream that only 2 major databases actually support it: Oracle and PostgreSQL (and HSQLDB and H2 in the Java ecosystem).

In PostgreSQL, you can write:

CREATE TABLE blogs (
  id    SERIAL NOT NULL PRIMARY KEY,
  title text   NOT NULL,
  tags  text[]
)

Or in Oracle:

-- Oracle only knows nominal array types, so we have to declare
-- them in advance
CREATE TYPE tag_t AS VARRAY(100) OF VARCHAR2(100 CHAR);

CREATE TABLE blogs (
  id    NUMBER(18) GENERATED BY DEFAULT AS IDENTITY 
                   NOT NULL PRIMARY KEY,
  title VARCHAR2(100 CHAR) NOT NULL,
  tags  tag_t
)

So, roughly the same thing. Now, let’s insert some data. How about the 3 most recent posts on the jOOQ blog, prior to this one:

In PostgreSQL:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  ARRAY[
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ]
), (
  'How to Execute SQL Batches With JDBC and jOOQ',
  ARRAY[
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ]
), (
  'How to Emulate Partial Indexes in Oracle',
  ARRAY[
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ]
)

Or in Oracle:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  tag_t(
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Execute SQL Batches With JDBC and jOOQ',
  tag_t(
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Emulate Partial Indexes in Oracle',
  tag_t(
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ));

Now, the array type by itself is not very useful. When it gets really interesting is when we unnest it again into a table. For instance in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) AS tags(tag);

Or in Oracle:

-- Classic style
SELECT title, tags.*
FROM blogs, TABLE(tags) tags;

-- Since Oracle 12c
SELECT title, tags.*
FROM blogs, LATERAL (SELECT * FROM TABLE(tags)) tags;

Note that we’re using the keyword LATERAL in some of the above queries. For those of you who are used to T-SQL syntax, it’s almost the same thing as APPLY. Both LATERAL and APPLY are also very useful with table valued functions (stay tuned for a blog post on those).

The idea behind LATERAL is that the table (derived table, subquery, function call, array unnesting) on the right side of LATERAL can “laterally” access stuff from the left side of LATERAL in order to produce new tables. In the above query, we’re producing a new table of tags for each blog post, and then we cross join the two tables.

Here’s what the above queries result in:

title                                                         tag
-----------------------------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   implicit cursor
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   batch
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   oracle
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jooq
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jdbc
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   resultset
How to Execute SQL Batches With JDBC and jOOQ                 batch
How to Execute SQL Batches With JDBC and jOOQ                 batch statement
How to Execute SQL Batches With JDBC and jOOQ                 mysql
How to Execute SQL Batches With JDBC and jOOQ                 jooq
How to Execute SQL Batches With JDBC and jOOQ                 jdbc
How to Execute SQL Batches With JDBC and jOOQ                 sql server
How to Execute SQL Batches With JDBC and jOOQ                 sql
How to Emulate Partial Indexes in Oracle                      optimisation
How to Emulate Partial Indexes in Oracle                      index
How to Emulate Partial Indexes in Oracle                      partial index
How to Emulate Partial Indexes in Oracle                      oracle
How to Emulate Partial Indexes in Oracle                      sql
How to Emulate Partial Indexes in Oracle                      postgresql
How to Emulate Partial Indexes in Oracle                      t-sql
How to Emulate Partial Indexes in Oracle                      sql server

You can immediately see the cross join semantics here, as we’re combining each tag (per post) with its post.

Looking for ordinals (i.e. the tag number inside of the array) along with the array? Easy!

Just add the powerful WITH ORDINALITY clause after the UNNEST() call in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) WITH ORDINALITY AS tags(tag);

A bit more complicated to emulate in Oracle:

-- Fancy, with a window function
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROW_NUMBER() OVER (ORDER BY NULL)
  FROM TABLE(tags) tags
) tags;

-- Classic, with ROWNUM
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROWNUM
  FROM TABLE(tags) tags
) tags;

The result now contains the tag “ID”, i.e the ordinal of the tag inside of the array:

title                                           tag               ordinal
-------------------------------------------------------------------------
How to Fetch ... Cursors with JDBC and jOOQ     implicit cursor   1
How to Fetch ... Cursors with JDBC and jOOQ     batch             2
How to Fetch ... Cursors with JDBC and jOOQ     oracle            3
How to Fetch ... Cursors with JDBC and jOOQ     jooq              4
How to Fetch ... Cursors with JDBC and jOOQ     jdbc              5
How to Fetch ... Cursors with JDBC and jOOQ     resultset         6
How to Execute SQL Batches With JDBC and jOOQ   batch             1
How to Execute SQL Batches With JDBC and jOOQ   batch statement   2
How to Execute SQL Batches With JDBC and jOOQ   mysql             3
How to Execute SQL Batches With JDBC and jOOQ   jooq              4
How to Execute SQL Batches With JDBC and jOOQ   jdbc              5
How to Execute SQL Batches With JDBC and jOOQ   sql server        6
How to Execute SQL Batches With JDBC and jOOQ   sql               7
How to Emulate Partial Indexes in Oracle        optimisation      1
How to Emulate Partial Indexes in Oracle        index             2
How to Emulate Partial Indexes in Oracle        partial index     3
How to Emulate Partial Indexes in Oracle        oracle            4
How to Emulate Partial Indexes in Oracle        sql               5
How to Emulate Partial Indexes in Oracle        postgresql        6
How to Emulate Partial Indexes in Oracle        t-sql             7
How to Emulate Partial Indexes in Oracle        sql server        8

Now, imagine looking for those blog posts that are tagged “jooq”. Easy!

PostgreSQL:

SELECT title
FROM blogs
WHERE 'jooq' = ANY(tags);

Oracle:

SELECT title
FROM blogs
WHERE 'jooq' IN (SELECT * FROM TABLE(tags));

Yielding:

title
-----------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ
How to Execute SQL Batches With JDBC and jOOQ

Conclusion

These are just a few nice things we can do when we denormalise our data into nested collections / arrays, and then use features like UNNEST to bring them back to the table level. Both Oracle and PostgreSQL support a variety of really nice features building on top of arrays, so do check them out!

How to Execute SQL Batches With JDBC and jOOQ


Some databases (in particular MySQL and T-SQL databases like SQL Server and Sybase) support a very nice feature: They allow for running a “batch” of statements in a single statement. For instance, in SQL Server, you can do something like this:

-- Statement #1
DECLARE @table AS TABLE (id INT);

-- Statement #2
SELECT * FROM @table;

-- Statement #3
INSERT INTO @table VALUES (1),(2),(3);

-- Statement #4
SELECT * FROM @table;

This is a batch of 4 statements, and it can be executed as a single statement both with JDBC and with jOOQ. Let’s see how:

Executing a batch with JDBC

Unfortunately, the term “batch” has several meanings, and in this case, I don’t mean the JDBC Statement.addBatch() method, which is actually a bit clumsy as it doesn’t allow for fetching mixed update counts and result sets.

Instead, what I’ll be doing is this:

String sql =
    "\n  -- Statement #1                              "
  + "\n  DECLARE @table AS TABLE (id INT);            "
  + "\n                                               "
  + "\n  -- Statement #2                              "
  + "\n  SELECT * FROM @table;                        "
  + "\n                                               "
  + "\n  -- Statement #3                              "
  + "\n  INSERT INTO @table VALUES (1),(2),(3);       "
  + "\n                                               "
  + "\n  -- Statement #4                              "
  + "\n  SELECT * FROM @table;                        ";

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();

        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");

                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

The output of the above program being:

Result:

Update Count: 3

Result:
  1
  2
  3

The above API usage is a somewhat “hidden” – or at least not every day usage of the JDBC API. Mostly, you’ll be using Statement.executeQuery() when you’re expecting a ResultSet, or Statement.executeUpdate() otherwise.

But in our case, we don’t really know what’s happening. We’re going to discover the result types on the fly, when executing the statement. Here are the main JDBC API features that we’re using, along with an explanation:

  • Statement.execute(): This method should be used if we don’t know the result type. The method returns a boolean, which is true when the first statement in the batch produced a ResultSet and false otherwise.
  • Statement.getMoreResults(): This method returns the same kind of boolean value as the previous Statement.execute() method, but it does so for the next statement in the batch (i.e. for every statement except the first).
  • If the current result is a ResultSet (the boolean was true), then we’ll obtain that ResultSet through Statement.getResultSet() (we can obviously no longer call the usual Statement.executeQuery() to obtain the ResultSet).
  • If the current result is not a ResultSet (the boolean was true), then we’ll check the update count value through Statement.getUpdateCount().
  • If the update count is -1, then we’ve reached the end of the batch.

What a nice state machine!

The nice thing about this is that a batch may be completely nondeterministic. E.g. there may be triggers, T-SQL blocks (e.g. an IF statement), stored procedures, and many other things that contribute result sets and/or update counts. In some cases, we simply don’t know what we’ll get.

Executing a batch with jOOQ

It’s great that the JDBC API designers have thought of this exotic API usage on a rather low level. This makes JDBC extremely powerful. But who remembers the exact algorithm all the time? After all, the above minimalistic version required around 20 lines of code for something as simple as that.

Compare this to the following jOOQ API usage:

System.out.println(
    DSL.using(c).fetchMany(sql)
);

The result being:

Result set:
+----+
|  id|
+----+
Update count: 3
Result set:
+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+

Huh! Couldn’t get much simpler than that! Let’s walk through what happens:

The DSLContext.fetchMany() method is intended for use when users know there will be many result sets and/or update counts. Unlike JDBC which reuses ordinary JDBC API, jOOQ has a different API here to clearly distinguish between behaviours. The method then eagerly fetches all the results and update counts in one go (lazy fetching is on the roadmap with issue #4503).

The resulting type is org.jooq.Results, a type that extends List<Result>, which allows for iterating over the results only, for convenience. If a mix of results or update counts need to be consumed, the Results.resultsOrRows() method can be used.