Creating Tables Dum and Dee in PostgreSQL


I was nerd-sniped:

So tables dee and dum are two theoretical tables in SQL, and they can be characterised as such:

[Dee] is the relation that has no attributes and a single tuple. It plays the role of True.

[Dum] is the relation that has no attributes and no tuples. It plays the role of False.

Quite academic? Sure. But the awesome PostgreSQL database can model these beasts! Check this out:

-- Creating the tables:
CREATE TABLE dum();
CREATE TABLE dee();
INSERT INTO dee DEFAULT VALUES;

-- Making sure the tables stay this way:
CREATE FUNCTION dum_trg ()
RETURNS trigger
AS $$
BEGIN
  RAISE EXCEPTION 'Dum must be empty';
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER dum_trg
BEFORE INSERT OR UPDATE OR DELETE OR TRUNCATE
ON dum
FOR EACH STATEMENT
EXECUTE PROCEDURE dum_trg();

CREATE FUNCTION dee_trg ()
RETURNS trigger
AS $$
BEGIN
  RAISE EXCEPTION 'Dee must keep one tuple';
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER dee_trg
BEFORE INSERT OR UPDATE OR DELETE OR TRUNCATE
ON dee
FOR EACH STATEMENT
EXECUTE PROCEDURE dee_trg();

And we’re done!

Check this out:

SELECT * FROM dum;

Nothing!

SELECT * FROM dee;

One row with no columns!

Just to be sure:

SELECT 'dum' AS t, count(*) 
FROM dum 
UNION ALL
SELECT 'dee' AS t, count(*) 
FROM dee;

And we’ll get, nicely:

Caveat

Note, it is worth mentioning that there are some flaws / “bugs” (in my opinion). This query:

SELECT DISTINCT * FROM dee;

… yields an error:

ERROR: SELECT DISTINCT must have at least one column
SQL state: 42601

I suspect the author(s) of the DISTINCT operation have overlooked a nice feature here. UNION on the other hand doesn’t work correctly either. It doesn’t remove duplicates (but also doesn’t complain:

SELECT * FROM dee
UNION
SELECT * FROM dee

This yields:

, which is surprising, because when we nest the record with the following useful PostgreSQL specific syntax, we’ll get a single nested empty row:

SELECT dee FROM dee
UNION
SELECT dee FROM dee

When we use EXCEPT or INTERSECT, however, clearly, we’re running in a bug. Both of these queries return the same result:

SELECT * FROM dee
EXCEPT
SELECT * FROM dee

SELECT * FROM dee
INTERSECT
SELECT * FROM dee

The same result as the UNION:

In any case: Every database schema should have these. Much more powerful than Oracle’s DUAL table. With this, have a nice weekend!

Faster SQL Through Occasionally Choosing Natural Keys Over Surrogate Keys


There are many many opinions out there regarding the old surrogate key vs. natural key debate. Most of the times, surrogate keys (e.g. sequence generated IDs) win because they’re much easier to design:

  • They’re easy to keep consistent across a schema (e.g. every table has an ID column, and that’s always the primary key)
  • They’re thus a no-brainer to add. When you create a new table, you don’t need to worry about any candidate keys
  • They’re guaranteed to be unique, because they have absolutely no business value, only a technical value

Great. So why bother using natural keys in the first place?

Well, there is a very compelling reason!

Performance!

Whenever you introduce surrogate keys, this means that your key data becomes completely meaningless. From a design perspective, that’s not too bad. You can easily join that other table to get the interesting, meaningful information that hides behind the surrogate foreign key value. For example, in our Sakila database

… we have a typical many-to-many relationship modelled with a relationship table between the FILM table and the CATEGORY table – state-of-the-art normalisation. But check out this interesting thing:

  • The FILM_CATEGORY relationship table doesn’t contain any interesting information at all. Just the relationships
  • The category table only contains a single useful column: The NAME column
  • The remaining columns (CATEGORY_ID and LAST_UPDATE) have absolutely no meaning

With this in mind, we could design a much simpler schema, where we use the category name as a natural key, and in fact, we don’t even need the CATEGORY table anymore, we can now remove it (that’s optional here. To ensure data correctness, we could keep it around, containing only the NAME column as a primary key). Check this out:

Now, if we run a query like the following one against our entire Sakila schema:

SELECT c.name, count(*)
FROM film_actor fa USING (actor_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

The query finds all categories a given actor played in, and the number of films that the given actor played in each category. For instance, this could be the result:

NAME       COUNT(*)
-------------------
Horror     3
Classics   2
Family     2
New        2
Games      2
Animation  1
Sports     1
Children   1
...

With an alternative schema where the category NAME has been moved to a new FILM_CATEGORY_NATURAL table, we could run this much simpler query:

SELECT fc.name, count(*)
FROM film_actor fa 
JOIN film_category_natural fc 
  USING (film_id)
WHERE actor_id = 1
GROUP BY fc.name
ORDER BY count(*) DESC

Notice how we can omit an entire JOIN.

The execution plans (here on Oracle) are quite different. Check this out:

Before:

After:

Unfortunately, the cost difference (8 vs 5) cannot be taken as a tool to compare actual costs between the two queries/plans. But the plans are otherwise very similar, except that we’re simply missing one table access (CATEGORY and an entire JOIN). That’s a significant improvement for something this simple. Imagine the improvement if we could roll out this kind of better query throughout the system?

We could look into more execution plan measurements (especially from the actual plan results), but what if we simply benchmark the two queries using the same silly benchmark, as always, repeating each statement 100 times:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT c.name, count(*)
      FROM film_actor fa USING (actor_id)
      JOIN film_category fc USING (film_id)
      JOIN category c USING (category_id)
      WHERE actor_id = 1
      GROUP BY c.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT fc.name, count(*)
      FROM film_actor fa 
      JOIN film_category_natural fc 
        USING (film_id)
      WHERE actor_id = 1
      GROUP BY fc.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The results are drastic:

Statement 1 : 00:00:00.122070000
Statement 2 : 00:00:00.051179000

A factor of 2.5x faster!

(Disclaimer: Benchmarks are an inaccurate way of measuring things because they suffer (or benefit) from “unfair” side-effects like heavy caching, but they’re a helpful tool to assess the order of magnitude of a difference between two queries).

The JOIN is completely unnecessary

The only reason why we joined the CATEGORY table each time is because we needed to display the meaningful business value of a category to the user, the CATEGORY.NAME. We could have avoided the JOIN if that was not a requirement, but displaying surrogate keys to the user (the CATEGORY_ID) would be rather harsh, wouldn’t it?

And we’re doing this all the time with all sorts of tables. We have:

  • Category tables (category name is a good candidate key)
  • Translation tables (label is a good candidate key)
  • Country tables (ISO 3166 country codes are a good candidate key)
  • Language tables (ISO 639 language codes are a good candidate key)
  • Currency tables (ISO 4217 currency codes are a good candidate key)
  • Stock symbol tables (ISIN security codes are a good candidate key)
  • … and many more

Working with natural keys can be quite cumbersome. But in some entities, the internationally standardised codes are really good candidate keys, and most of the time, they’re sufficient. What does the LANGUAGE_ID 47 even mean? It means nothing. An experienced DBA will remember, after a while, that it means “English”. But wouldn’t EN be a much better value?

You would have EN as a primary key AND foreign key value, so chances are, because everyone (including frontend developers who probably hard-code some translations anyway) knows what language EN is (but no one knows what 47 means), you will almost never need to join the language table again – except in those rare cases where you want to work with secondary language columns, such as, for instance, DESCRIPTION.

Now, imagine what happens if we search for English entries, or as in our previous example, for films of category “Classics”? Our entire JOIN graph would be simplified.

(Another place where we don’t need additional surrogate keys is the relationship table. In this particular case, there’s no such key anyway)

Caveats

Our category strings are quite short. If natural keys become longer, then the duplication itself can become a problem on a lower storage level, as you might need more pages and blocks to store the same amount of rows.

Please, do take this advice in this article with a grain of salt. The essence here is to not always follow strict rules that were established only as a good default. There are always tradeoffs!

Conclusion

This should be quite a straightforward refactoring for many applications. If your tables are extremely obvious picks for a natural key (like the above), then do use natural keys. Your queries will immediately be faster – not necessarily much faster, but probably you can speed up a significant number of queries, i.e. take load off your entire system. Plus: your database will be more user-friendly.

And all of this at the price of not using the identical table design everywhere. I mean – when was having an identical table design a real business case anyway, right?

Side-note

In some cases, you could take this even one step further and denormalise your schema by putting categories as arrays or XML or JSON data structures directly inside your films. You’ll lose the normalisation benefits, but you could further win in terms of performance. For more details, read this very interesting article (about PostgreSQL) here:

http://www.databasesoup.com/2015/01/tag-all-things.html

Many SQL Performance Problems Stem from “Unnecessary, Mandatory Work”


Probably the most impactful thing you could learn about when writing efficient SQL is indexing. A very close runner-up, however, is the fact that a lot of SQL clients demand tons of “unnecessary, mandatory work” from the database.

Repeat this after me:

Unnecessary, Mandatory Work

What is “unnecessary, mandatory work”? It’s two things (duh):

Unnecessary

Let’s assume your client application needs this information here:

Nothing out of the ordinary. We run a movie database (e.g. the Sakila database) and we want to display the title and rating of each film to the user.

This is the query that would produce the above result:

SELECT title, rating
FROM film

However, our application (or our ORM) runs this query instead:

SELECT *
FROM film

What are we getting? Guess what. We’re getting tons of useless information:

There’s even some complex JSON all the way to the right, which is loaded:

  • From the disk
  • Into the caches
  • Over the wire
  • Into the client memory
  • And then discarded

Yes, we discard most of this information. The work that was performed to retrieve it was completely unnecessary. Right? Agreed.

Mandatory

That’s the worse part. While optimisers have become quite smart these days, this work is mandatory for the database. There’s no way the database can know that the client application actually didn’t need 95% of the data. And that’s just a simple example. Imagine if we joined more tables…

So what, you think? Databases are fast? Let me offer you some insight you may not have thought of, before:

Memory consumption

Sure, the individual execution time doesn’t really change much. Perhaps, it’ll be 1.5x slower, but we can handle that right? For the sake of convenience? Sometimes that’s true. But if you’re sacrificing performance for convenience every time, things add up. We’re no longer talking about performance (speed of individual queries), but throughput (system response time), and that’s when stuff gets really hairy and tough to fix. When you stop being able to scale.

Let’s look at execution plans, Oracle this time:

--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 |   166K|
|   1 |  TABLE ACCESS FULL| FILM |  1000 |   166K|
--------------------------------------------------

Versus

--------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes |
--------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000 | 20000 |
|   1 |  TABLE ACCESS FULL| FILM |  1000 | 20000 |
--------------------------------------------------

We’re using 8x as much memory in the database when doing SELECT * rather than SELECT film, rating. That’s not really surprising though, is it? We knew that. Yet we accepted it in many many of our queries where we simply didn’t need all that data. We generated needless, mandatory work for the database, and it does sum up. We’re using 8x too much memory (the number will differ, of course).

Now, all the other steps (disk I/O, wire transfer, client memory consumption) are also affected in the same way, but I’m skipping those. Instead, I’d like to look at…

Index usage

Most databases these days have figured out the concept of covering indexes. A covering index is not a special index per se. But it can turn into a “special index” for a given query, either “accidentally,” or by design.

Check out this query:

SELECT * 
FROM actor
WHERE last_name LIKE 'A%'

There’s no extraordinary thing to be seen in the execution plan. It’s a simple query. Index range scan, table access, done:

-------------------------------------------------------------------
| Id  | Operation                   | Name                | Rows  |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                     |     8 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ACTOR               |     8 |
|*  2 |   INDEX RANGE SCAN          | IDX_ACTOR_LAST_NAME |     8 |
-------------------------------------------------------------------

Is it a good plan, though? Well, if what we really needed was this, then it’s not:

Sure, we’re wasting memory etc. But check out this alternative query:

SELECT first_name, last_name 
FROM actor
WHERE last_name LIKE 'A%'

Its plan is this:

----------------------------------------------------
| Id  | Operation        | Name            | Rows  |
----------------------------------------------------
|   0 | SELECT STATEMENT |                 |     8 |
|*  1 |  INDEX RANGE SCAN| IDX_ACTOR_NAMES |     8 |
----------------------------------------------------

We could now eliminate the table access entirely, because there’s an index that covers all the needs of our query… a covering index. Does it matter? Absolutely! This approach can speed up some of your queries by an order of magnitude (or slow them down by an order of magnitude when your index stops being covering after a change).

You cannot always profit from covering indexes. Indexes come with their own cost and you shouldn’t add too many of them, but in cases like these, it’s a no-brainer. Let’s run a benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Worst query: Memory overhead AND table access
      SELECT *
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Better query: Still table access
      SELECT /*+INDEX(actor(last_name))*/ 
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      -- Best query: Covering index
      SELECT /*+INDEX(actor(last_name, first_name))*/ 
        first_name, last_name
      FROM actor
      WHERE last_name LIKE 'A%'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The result is:

Statement 1 : +000000000 00:00:02.479000000
Statement 2 : +000000000 00:00:02.261000000
Statement 3 : +000000000 00:00:01.857000000

Note, the actor table only has 4 columns, so the difference between statements 1 and 2 is not too impressive, but still significant. Note also I’m using Oracle’s hints to force the optimiser to pick one or the other index for the query. Statement 3 clearly wins in this case. It’s a much better query, and that’s just an extremely simple query.

Again, when we write SELECT *, we create needless, mandatory work for the database, which it cannot optimise. It won’t pick the covering index because that index has a bit more overhead than the LAST_NAME index that it did pick, and after all, it had to go to the table anyway to fetch the useless LAST_UPDATE column, for instance.

But things get worse with SELECT *. Consider…

SQL transformations

Optimisers work so well, because they transform your SQL queries (watch my recent talk at Voxxed Days Zurich about how this works). For instance, there’s a SQL transformation called “JOIN elimination”, and it is really powerful. Consider this auxiliary view, which we wrote because we grew so incredibly tired of joining all these tables all the time:

CREATE VIEW v_customer AS
SELECT 
  c.first_name, c.last_name, 
  a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)

This view just connects all the “to-one” relationships between a CUSTOMER and their different ADDRESS parts. Thanks, normalisation.

Now, after a while working with this view, imagine, we’ve become so accustomed to this view, we forgot all about the underlying tables. And now, we’re running this query:

SELECT *
FROM v_customer

We’re getting quite some impressive plan:

----------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |   599 | 47920 |   14 |
|*  1 |  HASH JOIN           |          |   599 | 47920 |   14 |
|   2 |   TABLE ACCESS FULL  | COUNTRY  |   109 |  1526 |    2 |
|*  3 |   HASH JOIN          |          |   599 | 39534 |   11 |
|   4 |    TABLE ACCESS FULL | CITY     |   600 | 10800 |    3 |
|*  5 |    HASH JOIN         |          |   599 | 28752 |    8 |
|   6 |     TABLE ACCESS FULL| CUSTOMER |   599 | 11381 |    4 |
|   7 |     TABLE ACCESS FULL| ADDRESS  |   603 | 17487 |    3 |
----------------------------------------------------------------

Well, of course. We run all these joins and full table scans, because that’s what we told the database to do. Fetch all this data.

Now, again, imagine, what we really wanted on one particular screen was this:

Yeah, duh, right? By now you get my point. But imagine, we’ve learned from the previous mistakes and we’re now actually running the following, better query:

SELECT first_name, last_name
FROM v_customer

Now, check this out!

------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |   599 | 16173 |     4 |
|   1 |  NESTED LOOPS      |             |   599 | 16173 |     4 |
|   2 |   TABLE ACCESS FULL| CUSTOMER    |   599 | 11381 |     4 |
|*  3 |   INDEX UNIQUE SCAN| SYS_C007120 |     1 |     8 |     0 |
------------------------------------------------------------------

That’s a drastic improvement in the execution plan. Our joins were eliminated, because the optimiser could prove they were needless, so once it can prove this (and you don’t make the work mandatory by selecting *), it can remove the work and simply not do it. Why is that the case?

Each CUSTOMER.ADDRESS_ID foreign key guarantees that there is exactly one ADDRESS.ADDRESS_ID primary key value, so the JOIN operation is guaranteed to be a to-one join which does not add rows nor remove rows. If we don’t even select rows or query rows, well, we don’t need to actually load the rows at all. Removing the JOIN provably won’t change the outcome of the query.

Databases do these things all the time. You can try this on most databases:

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual

-- More reasonable SQL dialects, e.g. PostgreSQL
SELECT EXISTS (SELECT 1 / 0)

In this case, you might expect an arithmetic exception to be raised, as when you run this query:

SELECT 1 / 0 FROM dual

yielding

ORA-01476: divisor is equal to zero

But it doesn’t happen. The optimiser (or even the parser) can prove that any SELECT column expression in a EXISTS (SELECT ..) predicate will not change the outcome of a query, so there’s no need to evaluate it. Huh!

Meanwhile…

One of most ORM’s most unfortunate problems is the fact that they make writing SELECT * queries so easy to write. In fact, HQL / JPQL for instance, proceeded to making it the default. You can even omit the SELECT clause entirely, because after all, you’re going to be fetching the entire entity, as declared, right?

For instance:

FROM v_customer

Vlad Mihalcea for instance, a Hibernate expert and Hibernate Developer advocate recommends you use queries almost every time you’re sure you don’t want to persist any modifications after fetching. ORMs make it easy to solve the object graph persistence problem. Note: Persistence. The idea of actually modifying the object graph and persisting the modifications is inherent.

But if you don’t intend to do that, why bother fetching the entity? Why not write a query? Let’s be very clear: From a performance perspective, writing a query tailored to the exact use-case you’re solving is always going to outperform any other option. You may not care because your data set is small and it doesn’t matter. Fine. But eventually, you’ll need to scale and re-designing your applications to favour a query language over imperative entity graph traversal will be quite hard. You’ll have other things to do.

Counting for existence

Some of the worst wastes of resources is when people run COUNT(*) queries when they simply want to check for existence. E.g.

Did this user have any orders at all?

And we’ll run:

SELECT count(*)
FROM orders
WHERE user_id = :user_id

Easy. If COUNT = 0: No orders. Otherwise: Yes, orders.

The performance will not be horrible, because we probably have an index on the ORDERS.USER_ID column. But what do you think will be the performance of the above compared to this alternative here:

-- Oracle
SELECT CASE WHEN EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
) THEN 1 ELSE 0 END
FROM dual

-- Reasonable SQL dialects, like PostgreSQL
SELECT EXISTS (
  SELECT *
  FROM orders
  WHERE user_id = :user_id
)

It doesn’t take a rocket scientist to figure out that an actual existence predicate can stop looking for additional rows as soon as it found one. So, if the answer is “no orders”, then the speed will be comparable. If, however, the answer is “yes, orders”, then the answer might be drastically faster in the case where we do not calculate the exact count.

Because we don’t care about the exact count. Yet, we told the database to calculate it (needless), and the database doesn’t know we’re discarding all results bigger than 1 (mandatory).

Of course, things get much worse if you call list.size() on a JPA-backed collection to do the same…

I’ve blogged about this recently, and benchmarked the alternatives on different databases. Do check it out.

Conclusion

This article stated the “obvious”. Don’t tell the database to perform needless, mandatory work.

It’s needless because given your requirements, you knew that some specific piece of work did not need to be done. Yet, you tell the database to do it.

It’s mandatory because the database has no way to prove it’s needless. This information is contained only in the client, which is inaccessible to the server. So, the database has to do it.

This article talked about SELECT *, mostly, because that’s such an easy target. But this isn’t about databases only. This is about any distributed algorithm where a client instructs a server to perform needless, mandatory work. How many N+1 problems does your average AngularJS application have, where the UI loops over service result A, calling service B many times, instead of batching all calls to B into a single call? It’s a recurrent pattern.

The solution is always the same. The more information you give to the entity executing your command, the faster it can (in principle) execute such command. Write a better query. Every time. Your entire system will thank you for it.

If you liked this article…

… do also check out my recent talk at Voxxed Days Zurich, where I show some hyperbolic examples of why SQL will beat Java at data processing algorithms every time:

Impress Your Coworkers With the Incredible NATURAL FULL OUTER JOIN!


There are already only very few real-world use-cases for FULL [ OUTER ] JOIN, but maybe, you have run into this beast in the past. But when was the last time you’ve seen a NATURAL JOIN? Right.

A quick reminder from our article about JOINs:

FULL JOIN

A FULL JOIN is a type of OUTER JOIN that retains data from both sides of the JOIN operation, regardless if the JOIN predicate matches or not. For example, querying the Sakila database:

SELECT first_name, last_name, title
FROM actor
FULL JOIN film_actor USING (actor_id)
FULL JOIN film USING (film_id)

The result will look something like this:

first_name   last_name   title
--[ LEFT JOIN ] -------------------------
Jon          Doe                          <-- Actors that didn't play in any film
Claire       Miller                      
--[ INNER JOIN ] ------------------------
Samuel       Jackson     Django Unchained <-- Actors played in many films
Samuel       Jackson     Pulp Fiction
John         Travolta    Pulp Fiction     <-- Films featured several actors
--[ RIGHT JOIN ]-------------------------
                         Meh              <-- Films that don't have actors

Note the special “markers” that delimit the parts that would have also been contributed by LEFT JOIN, INNER JOIN, RIGHT JOIN. In a way, a FULL JOIN combines them all.

Handy, occasionally

NATURAL JOIN

This one is less obviously useful. It looks useful at first, because if we carefully design our schemas as in the Sakila database used above, then we can simplify our joins as such:

SELECT first_name, last_name, title
FROM actor
NATURAL JOIN film_actor
NATURAL JOIN film

Which is just syntax sugar for this:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (<all common columns>)
JOIN film USING (<all common columns>)

So, a NATURAL JOIN will join two tables using all the columns they have in common. You’d think that this would be the useful outcome:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id)
JOIN film USING (film_id)

Unfortunately, the outcome looks more like this:

SELECT first_name, last_name, title
FROM actor
JOIN film_actor USING (actor_id, last_update)
JOIN film USING (film_id, last_update)

In other words: It makes no sense at all. In real world situations (and the Sakila database is already close enough to the real world), we’ll always have accidentally conflicting column names that will be considered for NATURAL JOIN, when they hsould not be.

OK, so why NATURAL FULL JOIN?

… you’re asking. One is very rarely needed, and the other is pretty useless in the real world. Not so fast!

I recently helped a customer who had two very similar tables that both contained payments. Let’s call them: PAYMENT and STANDING_ORDER, the latter being a PAYMENT template which generates actual PAYMENTs periodically.

Now, there was some PL/SQL logic that operated on payments, which after some change request, needed to operate both on payments and/or on standing orders. If this was Java, we’d just (ab)use subtype polymorphism for the quick win, or factor out common logic into common component types for the thorough solution. For instance, these kinds of columns certainly appear in both cases:

“System” columns

  • ID – The surrogate key
  • CREATED_AT – Audit info
  • MODIFIED_AT – Audit info
  • DELETED – Soft deletion

“Business” columns

  • AMOUNT
  • CURRENCY

There are many more common columns. There are other columns that are specific to one or the other type:

Payment only

  • STANDING_ORDER_ID – Only payments reference the standing order that may have created them.

Standing order only

  • PERIODICITY – This is a standing order’s core feature.

Now, in order to refactor that PL/SQL logic, I needed a row type that combines all these columns easily. I didn’t want to manually track the exact columns of both tables, I wanted this to always work, even if the maintainers of the tables add/remove/rename columns.

NATURAL FULL JOIN to the rescue

I created a view like this:

CREATE VIEW like_a_payment
SELECT *
FROM payment
NATURAL FULL JOIN standing_order
WHERE 1 = 0

This view itself doesn’t do anything. It just sits there and produces the row type I need. Why? Because the NATURAL JOIN will join by ALL the common columns, namely ID, CREATED_AT, MODIFIED_AT, DELETED, AMOUNT, CURRENCY and then adds the columns from the left table, namely STANDING_ORDER_ID, and then from the right table, namely PERIODICITY.

I can now use this type in PL/SQL:

DECLARE
  paym like_a_payment%ROWTYPE;
BEGIN
  -- Do things...
END;

But why FULL?

True, I didn’t need FULL yet, because I’m not producing any rows anyway. Correct. But observe how I will populate the PAYM local variable in the PL/SQL block:

DECLARE
  l_paym_id like_a_payment.id%TYPE;
  l_paym    like_a_payment%ROWTYPE;
BEGIN
  l_paym_id := 1;

  SELECT *
  INTO l_paym
  FROM (
    SELECT * FROM payment WHERE id = l_paym_id
  )
  NATURAL FULL JOIN (
    SELECT * FROM standing_order WHERE id = l_paym_id
  );
END;

And I’m done! This could not have been done any more easily!

  • I’m using NATURAL because now I have to use it all the time to populate the record (or select individual columns, but meh)
  • I’m using FULL because only one of the two queries will produce a result, and the remaining columns will be NULL

Excellent!

Some observations

Of course, this is kind of a hack. Please don’t abuse this too often. Here’s some hints to follow:

  • The refactoring would have been substantial. I needed a quick win.
  • The logic actually needs almost all the columns from the payments. Otherwise, you might want to review that SELECT * usage.
  • The derived tables are necessary in Oracle for performance reasons.
  • The actual code contained two distinct queries, each one joining “one side” with an empty substitute query, as we know that the ID is either a payment or a standing order.

Having said so, now you know one of the very few real world use-cases of NATURAL FULL JOIN. Put this in your code base and impress your coworkers!

Beautiful SQL: Lateral Unnesting of Array Columns


Sometimes, SQL can just be so beautiful. One of the less mainstream features in SQL is the array type (or nested collections). In fact, it’s so not mainstream that only 2 major databases actually support it: Oracle and PostgreSQL (and HSQLDB and H2 in the Java ecosystem).

In PostgreSQL, you can write:

CREATE TABLE blogs (
  id    SERIAL NOT NULL PRIMARY KEY,
  title text   NOT NULL,
  tags  text[]
)

Or in Oracle:

-- Oracle only knows nominal array types, so we have to declare
-- them in advance
CREATE TYPE tag_t AS VARRAY(100) OF VARCHAR2(100 CHAR);

CREATE TABLE blogs (
  id    NUMBER(18) GENERATED BY DEFAULT AS IDENTITY 
                   NOT NULL PRIMARY KEY,
  title VARCHAR2(100 CHAR) NOT NULL,
  tags  tag_t
)

So, roughly the same thing. Now, let’s insert some data. How about the 3 most recent posts on the jOOQ blog, prior to this one:

In PostgreSQL:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  ARRAY[
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ]
), (
  'How to Execute SQL Batches With JDBC and jOOQ',
  ARRAY[
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ]
), (
  'How to Emulate Partial Indexes in Oracle',
  ARRAY[
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ]
)

Or in Oracle:

INSERT INTO blogs (title, tags)
VALUES (
  'How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ',
  tag_t(
    'implicit cursor',
    'batch',
    'oracle',
    'jooq',
    'jdbc',
    'resultset'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Execute SQL Batches With JDBC and jOOQ',
  tag_t(
    'batch',
    'batch statement',
    'mysql',
    'jooq',
    'jdbc',
    'sql server',
    'sql'
  ));
INSERT INTO blogs (title, tags)
VALUES (
  'How to Emulate Partial Indexes in Oracle',
  tag_t(
    'optimisation',
    'index',
    'partial index',
    'oracle',
    'sql',
    'postgresql',
    't-sql',
    'sql server'
  ));

Now, the array type by itself is not very useful. When it gets really interesting is when we unnest it again into a table. For instance in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) AS tags(tag);

Or in Oracle:

-- Classic style
SELECT title, tags.*
FROM blogs, TABLE(tags) tags;

-- Since Oracle 12c
SELECT title, tags.*
FROM blogs, LATERAL (SELECT * FROM TABLE(tags)) tags;

Note that we’re using the keyword LATERAL in some of the above queries. For those of you who are used to T-SQL syntax, it’s almost the same thing as APPLY. Both LATERAL and APPLY are also very useful with table valued functions (stay tuned for a blog post on those).

The idea behind LATERAL is that the table (derived table, subquery, function call, array unnesting) on the right side of LATERAL can “laterally” access stuff from the left side of LATERAL in order to produce new tables. In the above query, we’re producing a new table of tags for each blog post, and then we cross join the two tables.

Here’s what the above queries result in:

title                                                         tag
-----------------------------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   implicit cursor
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   batch
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   oracle
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jooq
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   jdbc
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ   resultset
How to Execute SQL Batches With JDBC and jOOQ                 batch
How to Execute SQL Batches With JDBC and jOOQ                 batch statement
How to Execute SQL Batches With JDBC and jOOQ                 mysql
How to Execute SQL Batches With JDBC and jOOQ                 jooq
How to Execute SQL Batches With JDBC and jOOQ                 jdbc
How to Execute SQL Batches With JDBC and jOOQ                 sql server
How to Execute SQL Batches With JDBC and jOOQ                 sql
How to Emulate Partial Indexes in Oracle                      optimisation
How to Emulate Partial Indexes in Oracle                      index
How to Emulate Partial Indexes in Oracle                      partial index
How to Emulate Partial Indexes in Oracle                      oracle
How to Emulate Partial Indexes in Oracle                      sql
How to Emulate Partial Indexes in Oracle                      postgresql
How to Emulate Partial Indexes in Oracle                      t-sql
How to Emulate Partial Indexes in Oracle                      sql server

You can immediately see the cross join semantics here, as we’re combining each tag (per post) with its post.

Looking for ordinals (i.e. the tag number inside of the array) along with the array? Easy!

Just add the powerful WITH ORDINALITY clause after the UNNEST() call in PostgreSQL:

SELECT title, tag
FROM blogs, LATERAL unnest(tags) WITH ORDINALITY AS tags(tag);

A bit more complicated to emulate in Oracle:

-- Fancy, with a window function
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROW_NUMBER() OVER (ORDER BY NULL)
  FROM TABLE(tags) tags
) tags;

-- Classic, with ROWNUM
SELECT title, tags.*
FROM blogs, LATERAL (
  SELECT tags.*, ROWNUM
  FROM TABLE(tags) tags
) tags;

The result now contains the tag “ID”, i.e the ordinal of the tag inside of the array:

title                                           tag               ordinal
-------------------------------------------------------------------------
How to Fetch ... Cursors with JDBC and jOOQ     implicit cursor   1
How to Fetch ... Cursors with JDBC and jOOQ     batch             2
How to Fetch ... Cursors with JDBC and jOOQ     oracle            3
How to Fetch ... Cursors with JDBC and jOOQ     jooq              4
How to Fetch ... Cursors with JDBC and jOOQ     jdbc              5
How to Fetch ... Cursors with JDBC and jOOQ     resultset         6
How to Execute SQL Batches With JDBC and jOOQ   batch             1
How to Execute SQL Batches With JDBC and jOOQ   batch statement   2
How to Execute SQL Batches With JDBC and jOOQ   mysql             3
How to Execute SQL Batches With JDBC and jOOQ   jooq              4
How to Execute SQL Batches With JDBC and jOOQ   jdbc              5
How to Execute SQL Batches With JDBC and jOOQ   sql server        6
How to Execute SQL Batches With JDBC and jOOQ   sql               7
How to Emulate Partial Indexes in Oracle        optimisation      1
How to Emulate Partial Indexes in Oracle        index             2
How to Emulate Partial Indexes in Oracle        partial index     3
How to Emulate Partial Indexes in Oracle        oracle            4
How to Emulate Partial Indexes in Oracle        sql               5
How to Emulate Partial Indexes in Oracle        postgresql        6
How to Emulate Partial Indexes in Oracle        t-sql             7
How to Emulate Partial Indexes in Oracle        sql server        8

Now, imagine looking for those blog posts that are tagged “jooq”. Easy!

PostgreSQL:

SELECT title
FROM blogs
WHERE 'jooq' = ANY(tags);

Oracle:

SELECT title
FROM blogs
WHERE 'jooq' IN (SELECT * FROM TABLE(tags));

Yielding:

title
-----------------------------------------------------------
How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ
How to Execute SQL Batches With JDBC and jOOQ

Conclusion

These are just a few nice things we can do when we denormalise our data into nested collections / arrays, and then use features like UNNEST to bring them back to the table level. Both Oracle and PostgreSQL support a variety of really nice features building on top of arrays, so do check them out!

How to Fetch Oracle 12c Implicit Cursors with JDBC and jOOQ


Earlier this week, I’ve blogged about how to execute SQL batches with JDBC and jOOQ. This was useful for the MySQL, SQL Server, and Sybase users among you.

Today, we’ll discuss a slightly more difficult task, how to fetch Oracle 12c implicit cursors – which are essentially the same thing.

What’s an implicit cursor?

Oracle 12c added new procedures to their dynamic SQL API DBMS_SQL. Just run the following query in SQL Developer to see the results:

DECLARE
  c1 sys_refcursor;
  c2 sys_refcursor;
BEGIN
  OPEN c1 FOR SELECT 1 AS a FROM dual;
  dbms_sql.return_result(c1);
  OPEN c2 FOR SELECT 2 AS b FROM dual;
  dbms_sql.return_result(c2);
END;

The anonymous PL/SQL block contains two cursors that are opened and returned to whoever calls this block using DBMS_SQL.RETURN_RESULT. This is kind of magic, as we’re calling a procedure, passing a cursor to it, and somehow, this has a side effect on the client of this program after the program ends.

Not only can you do this in anonymous PL/SQL blocks, you can nest these calls in any procedure, of course. So, in other words, from Oracle 12c onwards, you don’t know for sure if you call a procedure if there will be more results than what you can see. For instance:

BEGIN
  any_procedure();
END;

The above call might just as well yield some implicit cursors. You can’t know for sure.

How to discover implicit cursors with JDBC

With JDBC, if you don’t know for sure what your query will yield as a result, you use the Statement.execute(String), or the PreparedStatement.execute() method to find out. As mentioned in the previous post, this is what you would do:

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();
 
        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");
 
                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

Unfortunately, that won’t work on Oracle as Oracle’s JDBC driver doesn’t implement the JDBC spec correctly. I’ve documented this flaw in length on this Stack Overflow question here.

Using ojdbc, the following “improved” loop needs to be written:

/* Alternatively, use this for non-PreparedStatements:
try (Statement s = cn.createStatement()) {
    Boolean result = s.execute(sql); */
try (PreparedStatement s = cn.prepareStatement(sql)) {
    // Use good old three-valued boolean logic
    Boolean result = s.execute();

    fetchLoop:
    for (int i = 0;; i++) {

        // Check for more results if not already done in 
        // this iteration
        if (i > 0 && result == null)
            result = s.getMoreResults();
        System.out.println(result);

        if (result) {
            result = null;

            try (ResultSet rs = s.getResultSet()) {
                System.out.println("Fetching result " + i);
            }
            catch (SQLException e) {
                // Ignore ORA-17283: No resultset available (1)
                if (e.getErrorCode() == 17283)
                    continue fetchLoop;
                else
                    throw e;
            }
        }
        else if (s.getUpdateCount() == -1)
            // Ignore -1 value if there is one more result! (2)
            if (result = s.getMoreResults())
                continue fetchLoop;
            else
                break fetchLoop;
    }
}

Two elements of the above logic need more explanation:

  1. There’s a possibility of an ORA-17283: No resultset available error being raised when accessing the Statement.getResultSet() despite the previous call to Statement.execute() yielding true. If that happens, we’ll just ignore the error and try fetching another result set
  2. In case we’re using PreparedStatement, the original call to PreparedStatement.execute() will yield false (!) and the Statement.getUpdateCount() value is -1, which would normally mean that we should stop. Not in this case. Let’s just try one more time to get a result set, and tah-dah, here are our implicit result sets.

Note that the algorithm now works with both static Statement and PreparedStatement, which (very unfortunately) behave differently when calling execute().

The above will now work with any SQL statement. In case you’re using the previous SQL statement returning implicit cursors:

String sql =
    "\nDECLARE"
  + "\n  c1 sys_refcursor;"
  + "\n  c2 sys_refcursor;"
  + "\nBEGIN"
  + "\n  OPEN c1 FOR SELECT 1 AS a FROM dual;"
  + "\n  dbms_sql.return_result(c1);"
  + "\n  OPEN c2 FOR SELECT 2 AS a FROM dual;"
  + "\n  dbms_sql.return_result(c2);"
  + "\nEND;";

… you will now be able to fetch all the results:

true
true
Fetching result 1
true
Fetching result 2
false

How to get those cursors with jOOQ?

With jOOQ 3.10 (as always), you don’t need to worry about those low level JDBC details. Just call the following code:

System.out.println(
    DSL.using(cn).fetchMany(sql)
);

And you’ll get a convenient, object oriented representation of your multiple result sets in the form of an org.jooq.Results:

Result set:
+----+
|   A|
+----+
|   1|
+----+
Result set:
+----+
|   A|
+----+
|   2|
+----+

Even better, when you use a code generator to return multiple implicit cursors like this in a stored procedure, just call the generated stored procedure object like this, to get all the cursors automatically:

MyProcedure p = new MyProcedure();
p.setParameter1(x);
p.setParameter2(y);
p.execute(configuration);
Results results = p.getResults();

for (Result<?> result : results)
  for (Record record : result)
    System.out.println(record);

Done!

How to Execute SQL Batches With JDBC and jOOQ


Some databases (in particular MySQL and T-SQL databases like SQL Server and Sybase) support a very nice feature: They allow for running a “batch” of statements in a single statement. For instance, in SQL Server, you can do something like this:

-- Statement #1
DECLARE @table AS TABLE (id INT);

-- Statement #2
SELECT * FROM @table;

-- Statement #3
INSERT INTO @table VALUES (1),(2),(3);

-- Statement #4
SELECT * FROM @table;

This is a batch of 4 statements, and it can be executed as a single statement both with JDBC and with jOOQ. Let’s see how:

Executing a batch with JDBC

Unfortunately, the term “batch” has several meanings, and in this case, I don’t mean the JDBC Statement.addBatch() method, which is actually a bit clumsy as it doesn’t allow for fetching mixed update counts and result sets.

Instead, what I’ll be doing is this:

String sql =
    "\n  -- Statement #1                              "
  + "\n  DECLARE @table AS TABLE (id INT);            "
  + "\n                                               "
  + "\n  -- Statement #2                              "
  + "\n  SELECT * FROM @table;                        "
  + "\n                                               "
  + "\n  -- Statement #3                              "
  + "\n  INSERT INTO @table VALUES (1),(2),(3);       "
  + "\n                                               "
  + "\n  -- Statement #4                              "
  + "\n  SELECT * FROM @table;                        ";

try (PreparedStatement s = c.prepareStatement(sql)) {
    fetchLoop:
    for (int i = 0, updateCount = 0;; i++) {
        boolean result = (i == 0)
            ? s.execute()
            : s.getMoreResults();

        if (result)
            try (ResultSet rs = s.getResultSet()) {
                System.out.println("\nResult:");

                while (rs.next())
                    System.out.println("  " + rs.getInt(1));
            }
        else if ((updateCount = s.getUpdateCount()) != -1)
            System.out.println("\nUpdate Count: " + updateCount);
        else
            break fetchLoop;
    }
}

The output of the above program being:

Result:

Update Count: 3

Result:
  1
  2
  3

The above API usage is a somewhat “hidden” – or at least not every day usage of the JDBC API. Mostly, you’ll be using Statement.executeQuery() when you’re expecting a ResultSet, or Statement.executeUpdate() otherwise.

But in our case, we don’t really know what’s happening. We’re going to discover the result types on the fly, when executing the statement. Here are the main JDBC API features that we’re using, along with an explanation:

  • Statement.execute(): This method should be used if we don’t know the result type. The method returns a boolean, which is true when the first statement in the batch produced a ResultSet and false otherwise.
  • Statement.getMoreResults(): This method returns the same kind of boolean value as the previous Statement.execute() method, but it does so for the next statement in the batch (i.e. for every statement except the first).
  • If the current result is a ResultSet (the boolean was true), then we’ll obtain that ResultSet through Statement.getResultSet() (we can obviously no longer call the usual Statement.executeQuery() to obtain the ResultSet).
  • If the current result is not a ResultSet (the boolean was true), then we’ll check the update count value through Statement.getUpdateCount().
  • If the update count is -1, then we’ve reached the end of the batch.

What a nice state machine!

The nice thing about this is that a batch may be completely nondeterministic. E.g. there may be triggers, T-SQL blocks (e.g. an IF statement), stored procedures, and many other things that contribute result sets and/or update counts. In some cases, we simply don’t know what we’ll get.

Executing a batch with jOOQ

It’s great that the JDBC API designers have thought of this exotic API usage on a rather low level. This makes JDBC extremely powerful. But who remembers the exact algorithm all the time? After all, the above minimalistic version required around 20 lines of code for something as simple as that.

Compare this to the following jOOQ API usage:

System.out.println(
    DSL.using(c).fetchMany(sql)
);

The result being:

Result set:
+----+
|  id|
+----+
Update count: 3
Result set:
+----+
|  id|
+----+
|   1|
|   2|
|   3|
+----+

Huh! Couldn’t get much simpler than that! Let’s walk through what happens:

The DSLContext.fetchMany() method is intended for use when users know there will be many result sets and/or update counts. Unlike JDBC which reuses ordinary JDBC API, jOOQ has a different API here to clearly distinguish between behaviours. The method then eagerly fetches all the results and update counts in one go (lazy fetching is on the roadmap with issue #4503).

The resulting type is org.jooq.Results, a type that extends List<Result>, which allows for iterating over the results only, for convenience. If a mix of results or update counts need to be consumed, the Results.resultsOrRows() method can be used.