Don’t Migrate to MariaDB just yet. MySQL is Back!

Now that I have your attention, I’d like to invite you to a critical review of where we’re at in the MySQL vs. MariaDB debate. Around one month ago, I visited Oracle Open World 2014, and I’ve met with Morgan Tocker, the MySQL community manager at Oracle to learn about where MySQL is heading.

Who “is” MySQL

An interesting learning for myself is the fact that according to Morgan, there had been quite a few former MySQL AB employees that stayed with MySQL when Sun acquired the database, and that are still there now (perhaps after a short break), now that Oracle owns Sun and thus MySQL. In fact, right now as we speak, Oracle is pushing hard to get even more people on board of the MySQL team. When this article was written, there were 21 open MySQL jobs on this blog post dating February 25, 2014.

More details about Oracle’s plans to promote and push MySQL can be seen in this presentation by Morten Andersen:

oracle-mysql

So, if you’re still contemplating a migration to MariaDB, maybe give this all a second thought. MySQL is still the database that is being used by most large companies, including Facebook and LinkedIn. MySQL won’t go away and it will get much better in the next couple of years – maybe even to a point where it has a similar amount of features as PostgreSQL? We can only hope.

Need another convincing argument? You can now use Oracle Enterprise Manager for both Oracle and MySQL databases. How awesome is that!

enterprise-manager

Stay tuned for a couple of additional blog posts on this blog, about what’s new and what’s on the MySQL roadmap in the next years. See all of Morten Andersen’s slides here:

Frightening Facts about MySQL

So you might’ve seen Destroy all Software’s talk about JavaScript:

Here’s a similar talk (less funny more scary) about MySQL:

Should I Put That Table Alias or Not?

Infrequent SQL developers often get confused about when to put parentheses and/or aliases on derived tables. There has been this recent Reddit discussion about the subject, where user Elmhurstlol was wondering why they needed to provide an alias to the derived table (the subselect with the UNION) in the following query:

SELECT AVG(price) AS AVG_PRICE
FROM (
  SELECT price from product a JOIN pc b
  ON a.model=b.model AND maker='A'
  UNION ALL
  SELECT price from product a JOIN laptop c
  ON a.model=c.model and maker='A'
) hello

The question really was about why the "hello" table alias was necessary, because often it seems not to be required.

Here’s what the SQL standard states

If in doubt, it is often useful to consider the SQL standard about the rationale behind some syntax elements. In this case, let’s consider the freely available SQL 1992 standard text (for simplicity), and see how it specifies table references:

<table reference> ::=
    <table name> [ [ AS ] <correlation name>
        [ <left paren> <derived column list> 
          <right paren> ] ]
  | <derived table> [ AS ] <correlation name>
        [ <left paren> <derived column list>
          <right paren> ]
  | <joined table>

<derived table> ::= <table subquery>

<table subquery> ::= <subquery>

<subquery> ::= <left paren> <query expression>
               <right paren>

(for more information about the awesome and completely underused derived column list feature, read this article here)

The essence of the above syntax specification is this:

  • A derived table MUST always be aliased
  • The AS keyword is optional, for improved readability
  • The parentheses MUST always be put around subqueries

Following these rules, you’ll be pretty safe in most SQL dialects. Here are some deviations to the above, though:

  • Some dialects allow for unaliased derived tables. However, this is still a bad idea, because you will not be able to unambiguously qualify a column from such a derived table

Takeaway

Always provide a meaningful alias to your derived tables. As simple as that.

Yet Another 10 Common Mistakes Java Developers Make When Writing SQL (You Won’t BELIEVE the Last One)

(Sorry for that click-bait heading. Couldn’t resist ;-) )

We’re on a mission. To teach you SQL. But mostly, we want to teach you how to appreciate SQL. You’ll love it!

Getting SQL right or wrong shouldn’t be about that You’re-Doing-It-Wrong™ attitude that can be encountered often when evangelists promote their object of evangelism. Getting SQL right should be about the fun you’ll have once you do get it right. The things you start appreciating when you notice that you can easily replace 2000 lines of slow, hard-to-maintain, and ugly imperative (or object-oriented) code with 300 lines of lean functional code (e.g. using Java 8), or even better, with 50 lines of SQL.

We’re glad to see that our blogging friends have started appreciating SQL, and most specifically, window functions after reading our posts. For instance, take

So, after our previous, very popular posts:

… we’ll bring you:

Yet Another 10 Common Mistakes Java Developer Make When Writing SQL

And of course, this doesn’t apply to Java developers alone, but it’s written from the perspective of a Java (and SQL) developer. So here we go (again):

1. Not Using Window Functions

After all that we’ve been preaching, this must be our number 1 mistake in this series. Window functions are probably the coolest SQL feature of them all. They’re so incredibly useful, they should be the number one reason for anyone to switch to a better database, e.g. PostgreSQL:

If free and/or Open Source is important to you, you have absolutely no better choice than using PostgreSQL (and you’ll even get to use the free jOOQ Open Source Edition, if you’re a Java developer).

And if you’re lucky enough to work in an environment with Oracle or SQL Server (or DB2, Sybase) licenses, you get even more out of your new favourite tool.

We won’t repeat all the window function goodness in this section, we’ve blogged about them often enough:

The Cure:

Start playing with window functions. You’ll never go back, guaranteed.

2. Not declaring NOT NULL constraints

This one was already part of a previous list where we claimed that you should add as much metadata as possible to your schema, because your database will be able to leverage that metadata for optimisations. For instance, if your database knows that a foreign key value in BOOK.AUTHOR_ID must also be contained exactly once in AUTHOR.ID, then a whole set of optimisations can be achieved in complex queries.

Now let’s have another look at NOT NULL constraints. If you’re using Oracle, NULL values will not be part of your index. This doesn’t matter if you’re expressing an IN constraint, for instance:

SELECT * FROM table
WHERE value IN (
  SELECT nullable_column FROM ...
)

But what happens with a NOT IN constraint?

SELECT * FROM table
WHERE value NOT IN (
  SELECT nullable_column FROM ...
)

Due to SQL’s slightly unintuitive way of handling NULL, there is a slight risk of the second query unexpectedly not returning any results at all, namely if there is at least one NULL value as a result from the subquery. This is true for all databases that get SQL right.

But because the index on nullable_column doesn’t contain any NULL values, Oracle has to look up the complete content in the table, resulting in a FULL TABLE SCAN. Now that is unexpected! Details about this can be seen in this article.

The Cure:

Carefully review all your nullable, yet indexed columns, and check if you really cannot add a NOT NULL constraint to those columns.

The Tool:

If you’re using Oracle, use this query to detect all nullable, yet indexed columns:

SELECT
  i.table_name,
  i.index_name,
  LISTAGG(
    LPAD(i.column_position,  2) || ': ' || 
    RPAD(i.column_name    , 30) || ' '  ||
    DECODE(t.nullable, 'Y', '(NULL)', '(NOT NULL)'), 
    ', '
  ) WITHIN GROUP (ORDER BY i.column_position) 
    AS "NULLABLE columns in indexes"
FROM user_ind_columns i
JOIN user_tab_cols t
ON (t.table_name, t.column_name) = 
  ((i.table_name, i.column_name))
WHERE EXISTS (
  SELECT 1
  FROM user_tab_cols t
  WHERE (t.table_name, t.column_name, t.nullable) = 
       ((i.table_name, i.column_name, 'Y'       ))
)
GROUP BY i.table_name, i.index_name
ORDER BY i.index_name ASC;

Example output:

TABLE_NAME | INDEX_NAME   | NULLABLE columns in indexes
-----------+--------------+----------------------------
PERSON     | I_PERSON_DOB | 1: DATE_OF_BIRTH (NULL)

And then, fix it!

(Accidental criticism of Maven is irrelevant here ;-) )

If you’re curious about more details, see also these posts:

3. Using PL/SQL Package State

Now, this is a boring one if you’re not using Oracle, but if you are (and you’re a Java developer), be very wary of PL/SQL package state. Are you really doing what you think you’re doing?

Yes, PL/SQL has package-state, e.g.

CREATE OR REPLACE PACKAGE pkg IS
  -- Package state here!
  n NUMBER := 1;

  FUNCTION next_n RETURN NUMBER;
END pkg;

CREATE OR REPLACE PACKAGE BODY pkg IS
  FUNCTION next_n RETURN NUMBER
  IS
  BEGIN
    n := n + 1;
    RETURN n;
  END next_n;
END pkg;

Wonderful, so you’ve created yourself an in-memory counter that generates a new number every time you call pkg.next_n. But who owns that counter? Yes, the session. Each session has their own initialised “package instance”.

But no, it’s probably not the session you might have thought of.

We Java developers connect to databases through connection pools. When we obtain a JDBC Connection from such a pool, we recycle that connection from a previous “session”, e.g. a previous HTTP Request (not HTTP Session!). But that’s not the same. The database session (probably) outlives the HTTP Request and will be inherited by the next request, possibly from an entirely different user. Now, imagine you had a credit card number in that package…?

Not The Cure:

Nope. Don’t just jump to using SERIALLY_REUSABLE packages

CREATE OR REPLACE PACKAGE pkg IS
  PRAGMA SERIALLY_REUSABLE;
  n NUMBER := 1;

  FUNCTION next_n RETURN NUMBER;
END pkg;

Because:

  • You cannot even use that package from SQL, now (see ORA-06534).
  • Mixing this PRAGMA with regular package state from other packages just makes things a lot more complex.

So, don’t.

Not The Cure:

I know. PL/SQL can be a beast. It often seems like such a quirky language. But face it. Many things run much much faster when written in PL/SQL, so don’t give up, just yet. Dropping PL/SQL is not the solution either.

The Cure:

At all costs, try to avoid package state in PL/SQL. Think of package state as of static variables in Java. While they might be useful for caches (and constants, of course) every now and then, you might not actually access that state that you wanted. Think about load-balancers, suddenly transferring you to another JVM. Think about class loaders, that might have loaded the same class twice, for some reason.

Instead, pass state as arguments through procedures and functions. This will avoid side-effects and make your code much cleaner and more predictable.

Or, obviuously, persist state to some table.

4. Running the same query all the time

Master data is boring. You probably wrote some utility to get the latest version of your master data (e.g. language, locale, translations, tenant, system settings), and you can query it every time, once it is available.

At all costs, don’t do that. You don’t have to cache many things in your application, as modern databases have grown to be extremely fast when it comes to caching:

  • Table / column content
  • Index content
  • Query / materialized view results
  • Procedure results (if they’re deterministic)
  • Cursors
  • Execution plans

So, for your average query, there’s virtually no need for an ORM second-level cache, at least from a performance perspective (ORM caches mainly fulfil other purposes, of course).

But when you query master data, i.e. data that never changes, then, network latency, traffic and many other factors will impair your database experience.

The Cure:

Please do take 10 minutes, download Guava, and use its excellent and easy to set up cache, that ships with various built-in invalidation strategies. Choose time-based invalidation (i.e. polling), choose Oracle AQ or Streams, or PostgreSQL’s NOTIFY for event-based invalidation, or just make your cache permanent, if it doesn’t matter. But don’t issue an identical master data query all the time.

… This obviously brings us to

5. Not knowing about the N+1 problem

You had a choice. At the beginning of your software product, you had to choose between:

So, obviously, you chose an ORM, because otherwise you wouldn’t be suffering from “N+1”. What does “N+1” mean?

The accepted answer on this Stack Overflow question explains it nicely. Essentially, you’re running:

SELECT * FROM book

-- And then, for each book:
SELECT * FROM author WHERE id = ?
SELECT * FROM author WHERE id = ?
SELECT * FROM author WHERE id = ?

Of course, you could go and tweak your hundreds of annotations to correctly prefetch or eager fetch each book’s associated author information to produce something along the lines of:

SELECT * 
FROM   book
JOIN   author 
  ON   book.author_id = author.id

But that would be an awful lot of work, and you’ll risk eager-fetching too many things that you didn’t want, resulting in another performance issue.

Maybe, you could upgrade to JPA 2.1 and use the new @NamedEntityGraph to express beautiful annotation trees like this one:

@NamedEntityGraph(
    name = "post",
    attributeNodes = {
        @NamedAttributeNode("title"),
        @NamedAttributeNode(
            value = "comments", 
            subgraph = "comments"
        )
    },
    subgraphs = {
        @NamedSubgraph(
            name = "comments",
            attributeNodes = {
                @NamedAttributeNode("content")
            }
        )
    }
)

The example was taken from this blog post by Hantsy Bai. Hantsy then goes on explaining that you can use the above beauty through the following statement:

em.createQuery("select p from Post p where p.id=:id",
               Post.class)
  .setHint("javax.persistence.fetchgraph", 
           postGraph)
  .setParameter("id", this.id)
  .getResultList()
  .get(0);

Let us all appreciate the above application of JEE standards with all due respect, and then consider…

The Cure:

You just listen to the wise words at the beginning of this article and replace thousands of lines of tedious Java / Annotatiomania™ code with a couple of lines of SQL. Because that will also likely help you prevent another issue that we haven’t even touched yet, namely selecting too many columns as you can see in these posts:

Since you’re already using an ORM, this might just mean resorting to native SQL – or maybe you manage to express your query with JPQL. Of course, we agree with Alessio Harri in believing that you should use jOOQ together with JPA:

The Takeaway:

While the above will certainly help you work around some real world issues that you may have with your favourite ORM, you could also take it one step further and think about it this way. After all these years of pain and suffering from the object-relational impedance mismatch, the JPA 2.1 expert group is now trying to tweak their way out of this annotation madness by adding more declarative, annotation-based fetch graph hints to JPQL queries, that no one can debug, let alone maintain.

The alternative is simple and straight-forward SQL. And with Java 8, we’ll add functional transformation through the Streams API. That’s hard to beat.

But obviuosly, your views and experiences on that subject may differ from ours, so let’s head on to a more objective discussion about…

6. Not using Common Table Expressions

While common table expressions obviously offer readability improvements, they may also offer performance improvements. Consider the following query that I have recently encountered in a customer’s PL/SQL package (not the actual query):

SELECT round (
  (SELECT amount FROM payments WHERE id = :p_id)
    * 
  (
    SELECT e.bid
    FROM   currencies c, exchange_rates e
    WHERE  c.id     = 
      (SELECT cur_id FROM payments WHERE id = :p_id)
    AND    e.cur_id = 
      (SELECT cur_id FROM payments WHERE id = :p_id)
    AND    e.org_id = 
      (SELECT org_id FROM payments WHERE id = :p_id)
  ) / (
    SELECT c.factor
    FROM   currencies c, exchange_rates e
    WHERE  c.id     = 
      (SELECT cur_id FROM payments WHERE id = :p_id)
    AND    e.cur_id = 
      (SELECT cur_id FROM payments WHERE id = :p_id)
    AND    e.org_id = 
      (SELECT org_id FROM payments WHERE id = :p_id)
  ), 0
) 
INTO amount 
FROM dual;

So what does this do? This essentially converts a payment’s amount from one currency into another. Let’s not delve into the business logic too much, let’s head straight to the technical problem. The above query results in the following execution plan (on Oracle):

------------------------------------------------------
| Operation                         | Name           |
------------------------------------------------------
| SELECT STATEMENT                  |                |
|  TABLE ACCESS BY INDEX ROWID      | PAYMENTS       |
|   INDEX UNIQUE SCAN               | PAYM_PK        |
|   NESTED LOOPS                    |                |
|    INDEX UNIQUE SCAN              | CURR_PK        |
|     TABLE ACCESS BY INDEX ROWID   | PAYMENTS       |
|      INDEX UNIQUE SCAN            | PAYM_PK        |
|    TABLE ACCESS BY INDEX ROWID    | EXCHANGE_RATES |
|     INDEX UNIQUE SCAN             | EXCH_PK        |
|      TABLE ACCESS BY INDEX ROWID  | PAYMENTS       |
|       INDEX UNIQUE SCAN           | PAYM_PK        |
|      TABLE ACCESS BY INDEX ROWID  | PAYMENTS       |
|       INDEX UNIQUE SCAN           | PAYM_PK        |
|     NESTED LOOPS                  |                |
|      TABLE ACCESS BY INDEX ROWID  | CURRENCIES     |
|       INDEX UNIQUE SCAN           | CURR_PK        |
|        TABLE ACCESS BY INDEX ROWID| PAYMENTS       |
|         INDEX UNIQUE SCAN         | PAYM_PK        |
|      INDEX UNIQUE SCAN            | EXCH_PK        |
|       TABLE ACCESS BY INDEX ROWID | PAYMENTS       |
|        INDEX UNIQUE SCAN          | PAYM_PK        |
|       TABLE ACCESS BY INDEX ROWID | PAYMENTS       |
|        INDEX UNIQUE SCAN          | PAYM_PK        |
|  FAST DUAL                        |                |
------------------------------------------------------

The actual execution time is negligible in this case, but as you can see, the same objects are accessed again and again within the query. This is a violation of Common Mistake #4: Running the same query all the time.

The whole thing would be so much easier to read, maintain, and for Oracle to execute, if we had used a common table expression. From the original source code, observe the following thing:

-- We're always accessing a single payment:
  FROM payments WHERE id = :p_id

-- Joining currencies and exchange_rates twice:
  FROM currencies c, exchange_rates e

So, let’s factor out the payment first:

-- "payment" contains only a single payment
-- But it contains all the columns that we'll need
-- afterwards
WITH payment AS (
    SELECT cur_id, org_id, amount
    FROM   payments
    WHERE  id = :p_id
)
SELECT round(p.amount * e.bid / c.factor, 0)

-- Then, we simply don't need to repeat the
-- currencies / exchange_rates joins twice
FROM   payment p
JOIN   currencies c     ON  p.cur_id = c.id
JOIN   exchange_rates e ON  e.cur_id = p.cur_id
                        AND e.org_id = p.org_id

Note, that we’ve also replaced table lists with ANSI JOINs as suggested in our previous list

You wouldn’t believe it’s the same query, would you? And what about the execution plan? Here it is!

---------------------------------------------------
| Operation                      | Name           |
---------------------------------------------------
| SELECT STATEMENT               |                |
|  NESTED LOOPS                  |                |
|   NESTED LOOPS                 |                |
|    NESTED LOOPS                |                |
|     FAST DUAL                  |                |
|     TABLE ACCESS BY INDEX ROWID| PAYMENTS       |
|      INDEX UNIQUE SCAN         | PAYM_PK        |
|    TABLE ACCESS BY INDEX ROWID | EXCHANGE_RATES |
|     INDEX UNIQUE SCAN          | EXCH_PK        |
|   TABLE ACCESS BY INDEX ROWID  | CURRENCIES     |
|    INDEX UNIQUE SCAN           | CURR_PK        |
---------------------------------------------------

No doubt that this is much much better.

The Cure:

If you’re lucky enough and you’re using one of those databases that supports window functions, chances are incredibly high (100%) that you also have common table expression support. (Even MySQL 8.0 will finally have them).

Common table expressions are like local variables in SQL. In every large statement, you should consider using them, as soon as you feel that you’ve written something before.

The Takeaway:

Some databases (e.g. PostgreSQL, or SQL Server) also support common table expressions for DML statements. In other words, you can write:

WITH ...
UPDATE ...

This makes DML incredibly more powerful.

7. Not using row value expressions for UPDATEs

We’ve advertised the use of row value expressions in our previous listing. They’re very readable and intuitive, and often also promote using certain indexes, e.g. in PostgreSQL.

But few people know that they can also be used in an UPDATE statement, in most databases. Check out the following query, which I again found in a customer’s PL/SQL package (simplified again, of course):

UPDATE u
SET n = (SELECT n + 1    FROM t WHERE u.n = t.n),
    s = (SELECT 'x' || s FROM t WHERE u.n = t.n),
    x = 3;

So this query takes a subquery as a data source for updating two columns, and the third column is updated “regularly”. How does it perform? Moderately:

-----------------------------
| Operation          | Name |
-----------------------------
| UPDATE STATEMENT   |      |
|  UPDATE            | U    |
|   TABLE ACCESS FULL| U    |
|   TABLE ACCESS FULL| T    |
|   TABLE ACCESS FULL| T    |
-----------------------------

Let’s ignore the full table scans, as this query is constructed. The actual query could leverage indexes. But T is accessed twice, i.e. in both subqueries. Oracle didn’t seem to be able to apply scalar subquery caching in this case.

To the rescue: row value expressions. Let’s simply rephrase our UPDATE to this:

UPDATE u
SET (n, s) = ((
      SELECT n + 1, 'x' || s FROM t WHERE u.n = t.n
    )),
    x = 3;

Let’s ignore the funny, Oracle-specific double-parentheses syntax for the right hand side of such a row value expression assignment, but let’s appreciate the fact that we can easily assign a new value to the tuple (n, s) in one go! Note, we could have also written this, instead, and assign x as well:

UPDATE u
SET (n, s, x) = ((
      SELECT n + 1, 'x' || s, 3 
      FROM t WHERE u.n = t.n
    ));

As you will have expected, the execution plan has also improved, and T is accessed only once:

-----------------------------
| Operation          | Name |
-----------------------------
| UPDATE STATEMENT   |      |
|  UPDATE            | U    |
|   TABLE ACCESS FULL| U    |
|   TABLE ACCESS FULL| T    |
-----------------------------

The Cure:

Use row value expressions. Where ever you can. They make your SQL code incredibly more expressive, and chances are, they make it faster, as well.

Note that the above is supported by jOOQ’s UPDATE statement. This is the moment we would like to make you aware of this cheap, in-article advertisement:

jOOQ - The best way to write SQL in Java

;-)

8. Using MySQL when you could use PostgreSQL

Disclaimer: I wrote this article in 2014. Since then, MySQL has made tremendous progress, so I wouldn’t bet all my money on PostgreSQL any longer, when comparing it with MySQL! So, read this tip with the 2014 context in mind!

To some, this may appear to be a bit of a hipster discussion. But let’s consider the facts:

  • MySQL claims to be the “most popular Open Source database”.
  • PostgreSQL claims to be the “most advanced Open Source database”.

Let’s consider a bit of history. MySQL has always been very easy to install, maintain, and it has had a great and active community. This has lead to MySQL still being the RDBMS of choice with virtually every web hoster on this planet. Those hosters also host PHP, which was equally easy to install, and maintain.

BUT!

We Java developers tend to have an opinion about PHP, right? It’s summarised by this image here:

The PHP Hammer

The PHP Hammer

Well, it works, but how does it work?

The same can be said about MySQL. MySQL has always worked somehow, but while commercial databases like Oracle have made tremendous progress both in terms of query optimisation and feature scope, MySQL has hardly moved in the last decade.

Many people choose MySQL primarily because of its price (USD $ 0.00). But often, the same people have found MySQL to be slow and quickly concluded that SQL is slow per se – without evaluating the options. This is also why all NoSQL stores compare themselves with MySQL, not with Oracle, the database that has been winning the Transaction Processing Performance Council’s (TPC) benchmarks almost forever. Some examples:

While the last article bluntly adds “(and other RDBMS)” it doesn’t go into any sort of detail whatsoever, what those “other RDBMS” do wrong. It really only compares MongoDB with MySQL.

The Cure:

We say: Stop complaining about SQL, when in fact, you’re really complaining about MySQL. There are at least four very popular databases out there that are incredibly good. These are:

(just kidding about the last one, of course)

Again, remember, this was said in 2014. Today, MySQL 8.0 has made tremendous progress, so please review this opinion!

The Takeaway:

Don’t fall for agressive NoSQL marketing. 10gen is an extremely well-funded company, even if MongoDB continues to disappoint, technically.

The same is true for Datastax.

Both companies are solving a problem that few people have. They’re selling us niche products as commodity, making us think that our real commodity databases (the RDBMS) no longer fulfil our needs. They are well-funded and have big marketing teams to throw around with blunt claims.

In the mean time, PostgreSQL just got even better, and you, as a reader of this blog / post, are about to bet on the winning team :-)

… just to cite Mark Madsen once more:

The Disclaimer:

This article has been quite strongly against MySQL. We don’t mean to talk badly about a database that perfectly fulfils its purpose, as this isn’t a black and white world. Heck, you can get happy with SQLite in some situations. MySQL, being the cheap and easy to use, easy to install commodity database. We just wanted to make you aware of the fact, that you’re expressly choosing the cheap, not-so-good database, rather than the cheap, awesome one.

9. Forgetting about UNDO / REDO logs

We have claimed that MERGE statements or bulk / batch updates are good. That’s correct, but nonetheless, you should be wary when updating huge data sets in transactional contexts. If your transaction “takes too long”, i.e. if you’re updating 10 million records at a time, you will run into two problems:

  • You increase the risk of race conditions, if another process is also writing to the same table. This may cause a rollback on their or on your transaction, possibly making you roll out the huge update again
  • You cause a lot of concurrency on your system, because every other transaction / session, that wants to see the data that you’re about to update, will have to temporarily roll back all of your updates first, before they reach the state on disk that was there before your huge update. That’s the price of ACID.

One way to work around this issue is to allow for other sessions to read uncommitted data.

Another way to work around this issue is to frequently commit your own work, e.g. after 1000 inserts / updates.

In any case, due to the CAP theorem, you will have to make a compromise. Frequent commits will produce the risk of an inconsistent database in the event of the multi-million update going wrong after 5 million (committed) records. A rollback would then mean to revert all database changes towards a backup.

The Cure:

There is no definitive cure to this issue. But beware that you are very very rarely in a situation where it is OK to simply update 10 million records of a live and online table outside of an actual scheduled maintenance window. The simplest acceptable workaround is indeed to commit your work after N inserts / updates.

The Takeaway:

By this time, NoSQL aficionados will claim (again due to excessive marketing by aforementioned companies) that NoSQL has solved this by dropping schemas and typesafety. “Don’t update, just add another property!” – they said.

But that’s not true!

First off, I can add columns to my database without any issue at all. An ALTER TABLE ADD statement is executed instantly on live databases. Filling the column with data doesn’t bother anyone either, because no one reads the column yet (remember, don’t SELECT * !). So adding columns in RDBMS is as cheap as adding JSON properties to a MongoDB document.

But what about altering columns? Removing them? Merging them?

It is simply not true that denormalisation takes you anywhere far. Denormalisation is always a short-term win for the developer. Hardly a long-term win for the operations teams. Having redundant data in your database for the sake of speeding up an ALTER TABLE statement is like sweeping dirt under the carpet.

Don’t believe the marketers. And while you’re at it, perform some doublethink and forget that we’re SQL tool vendors ourselves ;-) Here’s again the “correct” message:

jOOQ - The best way to write SQL in Java

10. Not using the BOOLEAN type correctly

This is not really a mistake per se. It’s just again something that hardly anyone knows. When the SQL:1999 standard introduced the new BOOLEAN data type, they really did it right. Because before, we already had something like booleans in SQL. We’ve had <search condition> in SQL-92, which are essentially predicates for use in WHERE, ON, and HAVING clauses, as well as in CASE expressions.

SQL:1999, however, simply defined the new <boolean value expression> as a regular <value expression>, and redefined the <search condition> as such:

<search condition> ::=
    <boolean value expression>

Done! Now, for most of us Java / Scala / etc. developers, this doesn’t seem like such an innovation. Heck it’s a boolean. Obviuosly it can be interchangeably used as predicate and as variable.

But in the mind-set of the keyword-heavy SQL folks who have taken inspiration from COBOL when designing the language, this was quite a step forward.

Now, what does this mean? This means that you can use any predicate also as a column! For instance:

SELECT a, b, c
FROM (
  SELECT EXISTS (SELECT ...) a,
         MY_COL IN (1, 2, 3) b,
         3 BETWEEN 4 AND 5   c
  FROM   MY_TABLE
) t
WHERE a AND b AND NOT(c)

This is a bit of a dummy query, agreed, but are you aware of how powerful this is?

Luckily, again, PostgreSQL fully supports this (unlike Oracle, which still doesn’t have any BOOLEAN data type in SQL).

The Cure:

Every now and then, using BOOLEAN types feels very right, so do it! You can transform boolean value expressions into predicates and predicates into boolean value expressions. They’re the same. This makes SQL all so powerful.

Conclusion

SQL has evolved steadily over the past years through great standards like SQL:1999, SQL:2003, SQL:2008 and now SQL:2011. It is the only surviving mainstream declarative language, now that XQuery can be considered pretty dead for the mainstream. It can be easily mixed with procedural languages, as PL/SQL and T-SQL (and other procedural dialects) have shown. It can be easily mixed with object-oriented or functional languages, as jOOQ has shown.

At Data Geekery, we believe that SQL is the best way to query data. You don’t agree with any of the above? That’s fine, you don’t have to. Sometimes, even we agree with Winston Churchill who is known to have said:

tweet thisSQL is the worst form of database querying, except for all the other forms.

But as Yakov Fain has recently put it:

You can run from SQL, but you can’t hide

So, let’s better get back to work and learn this beast! Thanks for reading.

MySQL Bad Idea #666

MySQL… We’ve blogged about MySQL before. Many times. We’ve shown bad ideas implemented in MySQL here:

But this beats everything. Check out this Stack Overflow question. It reads: “Why Oracle does not support ‘group by 1,2,3’?”. At first, I thought this user might have been confused because SQL allows for referencing columns by (1-based!) index in ORDER BY clauses:

SELECT first_name, last_name
FROM customers
ORDER BY 1, 2

The above is equivalent to ORDER BY first_name, last_name. The indexes 1, 2 refer to columns from the projection. This might be useful every now and then to avoid repeating complex column expressions, although it is probably a bit risky as you can change ordering semantics when adding a column to the SELECT clause.

But this user wanted to use the same syntax for the GROUP BY clause. And this actually works in MySQL! Check out the following query:

SELECT a, b
FROM (
  SELECT 'a' a, 'b' b, 'c' c UNION ALL
  SELECT 'a'  , 'c'  , 'c'   UNION ALL
  SELECT 'a'  , 'b'  , 'd'
) t
GROUP BY 1, 2
ORDER BY 2

See an example on SQLFiddle

The above yields…

| A | B |
|---|---|
| a | b |
| a | c |

But what would this even mean? According to our in-depth explanation of SQL clauses, the projection (SELECT clause) is logically evaluated after the GROUP BY clause. In other words, the columns defined in the SELECT clause are not yet in scope of the GROUP BY clause. Hence, the only reasonable semantics of column indexes would be the index from the table source t. But that’s not the case. Check out this alternative query:

SELECT a, b
FROM (
  SELECT 'a' a, 'b' b, 'c' c UNION ALL
  SELECT 'a'  , 'c'  , 'c'   UNION ALL
  SELECT 'a'  , 'b'  , 'd'
) t
GROUP BY 1, 2
ORDER BY 2

See an example on SQLFiddle

This now yields:

| B | C |
|---|---|
| b | c |
| c | c |
| b | d |

And it’s (probably?) the expected behaviour in MySQL as the documentation states:

Columns selected for output can be referred to in ORDER BY and GROUP BY clauses using column names, column aliases, or column positions. Column positions are integers and begin with 1

The documentation actually treats GROUP BY in a very similar fashion as ORDER BY. For instance, it is possile to specify the ordering direction using GROUP BY only:

MySQL extends the GROUP BY clause so that you can also specify ASC and DESC after columns named in the clause:

SELECT a, COUNT(b) 
FROM test_table GROUP BY a DESC;

While we cannot figure out a reasonable edge-case that breaks this feature, we still think that there is something fishy about it. The fact that the SELECT clause is logically evaluated after the table source (FROM, WHERE, GROUP BY, HAVING), allowing the GROUP BY clause to reference it seems to lead to a weird understanding of SQL.

On the other hand, SQL is a very verbose language with little support for declaring reusable objects, short of common table expressions and the WINDOW clause. It is actually a bit surprising that the SQL standards folks would support this WINDOW clause to declare reusable window frames before introducing much more usable “common column expressions”, e.g:

-- Common column/table expressions:
WITH x AS CASE t1.a 
          WHEN 1 THEN 'a'
          WHEN 2 THEN 'b'
                 ELSE 'c'
          END, 
     y AS SOME_FUNCTION(t2.a, t2.b)
SELECT x, NVL(y, x)
FROM t1 JOIN t2 ON t1.id = t2.id
GROUP BY x, y
ORDER BY x DESC, y ASC

With common column expressions, reusing column expressions is independent of the SELECT clause itself. In other words, you can reuse column expressions in JOIN clauses, WHERE clauses, GROUP BY clauses, HAVING clauses, etc. without having to actually SELECT them.

So, to be fair with MySQL, while this feature is a non-feature in its current form, it provides a workaround for SQL’s verbosity.

The Crystal Ball. Or, Oops, Michael Stonebraker did it Again

Michael Stonebraker’s opinions and claims are always refreshing to read. He’s done a lot for our industry and for how we do data processing. Some of his claims are certainly right as well. Here’s an interview with him, telling us about his 5 predictions on the future of databases.

Of course, him being a software vendor, many of his claims should be read with caution. Today, the most popular DBMS (relational or not) are still Oracle, MySQL, and SQL Server. Even his “popular” PostgreSQL is still a niche player, let alone the almost forgotten Ingres and the never really popular Vertica columnar “NewSQL” database. Obviously, we’re not saying they’re bad databases, but they’re certainly not very popular. The same goes with SAP. Their Sybase databases have been surpassed by SQL Server both in quality and in popularity 10 years ago, when Microsoft forked the Sybase code into T-SQL. We hardly believe that Oracle and Sybase will have the “final fight” for RDBMS supremacy.

But again. That’s Mike Stonebraker, the salesman as in “The Traditional RDBMS Wisdom is All Wrong”

Counting Distinct Records in SQL

The SQL language and its depths… Some of you readers might be aware of MySQL’s capability of counting distinct records through the COUNT() aggregate function. The MySQL documentation reads:

COUNT(DISTINCT expr,[expr...])

Returns a count of the number of rows with different non-NULL expr values.

In other words, you can count distinct first and last names very easily:

SELECT COUNT(DISTINCT FIRST_NAME, LAST_NAME)
FROM CUSTOMERS

That’s quite useful, but MySQL-specific (although HSQLDB also supports this particular MySQL syntax). Most other databases, however, do not offer such a feature, even if the SQL-99 standard has specified it long ago:

6.16  <set function specification>

<set function specification> ::=
    COUNT <left paren> <asterisk> <right paren>
  | <general set function>

<general set function> ::=
    <set function type>
        <left paren> [ <set quantifier> ] 
            <value expression> <right paren>

<set function type> ::=
    <computational operation>

<computational operation> ::=
    AVG | MAX | MIN | SUM
  | EVERY | ANY | SOME
  | COUNT

<set quantifier> ::= DISTINCT | ALL

The above was later on merged into 10.9 <aggregate function>, in SQL:2011, with lots of other aggregate function types and features. Now, let’s have a look at 6.23 <value expression>:

6.23 <value expression>

<value expression> ::=
<numeric value expression>
| <string value expression>
| <datetime value expression>
| <interval value expression>
| <boolean value expression>
| <user-defined type value expression>
| <row value expression>                <-- RVE!
| <reference value expression>
| <collection value expression>

Interestingly, you can put a row value expression in some of your aggregate functions. There are additional restrictions, as you cannot SUM() or AVG() records, for instance. But with COUNT() and COUNT(DISTINCT ...) this makes perfect sense. So, according to the SQL standard (and according to HSQLDB’s alternative, and PostgreSQL’s main implementation, which unfortunately isn’t really documented), the following would be the correct way to count distinct first and last names in SQL:

SELECT COUNT(DISTINCT (FIRST_NAME, LAST_NAME))
FROM CUSTOMERS

… which makes perfect sense. Watch out when reading this. This isn’t a DISTINCT() function! This is the DISTINCT keyword applied to a row value expression. See some previous blog posts about row value expressions here:

jOOQ standardises various SQL dialects and hides the above behind the DSL.countDistinct() method.

When you’re not using HSQLDB, MySQL, or PostgreSQL

In other databases, you might resort to some manual SQL transformation. Our example is easy to transform:

SELECT COUNT(*)
FROM (
  SELECT DISTINCT FIRST_NAME, LAST_NAME
  FROM CUSTOMERS
) t

But such query transformation can prove to be tricky, once you have GROUP BY clauses or other aggregate functions involved. Future versions of jOOQ might do that sort of transformation for you.