A Bit of SQL History

SQL has been around for a while. Just recently, I’ve read this refreshing article about Codd’s Relational Vision – Has NoSQL Come Full Circle? SQL is one of those awesome technologies that was invented almost two generations ago, and we’re still using it in production and for new projects. It is both legacy and state-of-the-art.

In the early 70s, Ingres (which still exists today) was built at the Berkely University of California. Parts of Ingres still go strong in today’s RDBMS market under the name of PostgreSQL, and many other databases. Michael Stonebraker has been involved with these early dinosaurs, and he’s currently promoting NewSQL storage, which isn’t listed in the above chart, unfortunately.

What’s also not listed, unfortunately, are all those various NoSQL storage systems, which will eventually merge in a greater timeline of DBMS, should they survive the upcoming second renaissance of Codd’s Vision. Love it or hate it, but SQL is a piece of software category that we will deal with for another generation of software engineers!

The Crystal Ball. Or, Oops, Michael Stonebraker did it Again

Michael Stonebraker’s opinions and claims are always refreshing to read. He’s done a lot for our industry and for how we do data processing. Some of his claims are certainly right as well. Here’s an interview with him, telling us about his 5 predictions on the future of databases.

Of course, him being a software vendor, many of his claims should be read with caution. Today, the most popular DBMS (relational or not) are still Oracle, MySQL, and SQL Server. Even his “popular” PostgreSQL is still a niche player, let alone the almost forgotten Ingres and the never really popular Vertica columnar “NewSQL” database. Obviously, we’re not saying they’re bad databases, but they’re certainly not very popular. The same goes with SAP. Their Sybase databases have been surpassed by SQL Server both in quality and in popularity 10 years ago, when Microsoft forked the Sybase code into T-SQL. We hardly believe that Oracle and Sybase will have the “final fight” for RDBMS supremacy.

But again. That’s Mike Stonebraker, the salesman as in “The Traditional RDBMS Wisdom is All Wrong”

Auto-Creation of Indexes in RDBMS

[…] generally speaking, I’m also surprised to see that in 2013 we’re creating our indexes manually.

Interesting thought! Has this thought ever occurred to you?

How this comment came about

Hackernews is very predictable. Our latest pro-SQL marketing campaign for jOOQ got quite a bit of traction as expected. It is easy to trigger love and hate for NoSQL databases with a little bit of humour, such as with Mark Madsen’s “history of databases in no-tation”.

A much more interesting and more serious blog post is Doug Turnbull’s “Codd’s Relational Vision – Has NoSQL Come Full Circle?”, which we are going to blog about soon, in a separate post. We’ve also put the latter on Hackernews and on Reddit, both of which generated tremendous traction for the subject. Comparing the current state of “NoSQL” with pre-Codd, pre-relational, pre-SQL times is clever and matches Michael Stonebraker’s and Joseph M. Hellerstein’s observations in “What Goes Around Comes Around”.

NoSQL is a movement that emerged out of necessity, when SQL databases did not evolve fast enough to keep up with what keen observers and Gartner now call “Webscale”, a new buzzword to name old things. But as history has shown, the old elephants can be taught new tricks, and SQL databases will eventually catch up.

Auto-creation of indexes in RDBMS

In the middle of the above Hackernews discussion, MehdiEG made this interesting observation about creating indexes manually being tedious. Indeed, why do we have to maintain all of our indexes manually? While platforms like Use-The-Index-Luke.com profit from teaching people how to do proper indexing, I wonder if a highly sophisticated database couldn’t gather statistics about a productive system and then generate suggestions for index additions / removals. Even more so, if the database is “absolutely sure”, it could also create/drop or at least activate/deactivate relevant indexes.

What does “absolutely sure” mean?

The Oracle database for instance, is already quite good at gathering relevant statistics and giving DBA hints about potentially effective new indexes, as it can simulate execution plans in case indexes were added. Some more information can be seen in this Stack Overflow question.

But wouldn’t it be great if Oracle (or SQL Server, DB2, any other database) had an auto-index-creation feature? On a productive system, the database could gather statistics for the longest-running queries, analyse their execution plans, simulate alternative execution plans in case potentially useful indexes were added to improve SELECT statements, or removed to improve INSERT, UPDATE, DELETE, MERGE statements. This wouldn’t be a simple task, as all available (or at least the 100 most executed) execution plans would have to be re-calculated to see how the newly added or removed index would impact the productive system.

There are a couple of things to note here:

  1. Fine-tuning indexing is easiest on a productive system. If you’re tuning your development environment, you will get most of the cases right. But only the productive system will show all those weird edge-cases that you simply cannot foresee
  2. Analysing the productive system is hard and is usually performed by the devops team or the DBA team. They’re often not the same people as the ones who developed the application / database. Since they often cannot access the DML or DDL of the application, it’s always good if they have some automatic tuning features such as the existing cost-based optimiser
  3. Blindly adding indexes without measuring is bad practice. If you know that a table is mostly-read-only, then you’re mostly-on-the-safe-side. But what happens if a table is often bulk updated? If a batch job creates large transactions with long UNDO / REDO logs? Each unnecessary index will only slow down the batch job, increasing the risk of race conditions, rollbacks or even deadlocks.

Automatic index creation or deletion could greatly improve the productive experience with commercial databases that already have many very useful tuning features. Let us hope that Oracle, IBM, Microsoft will hear us and build such a feature into their future databases!

jOOQ Newsletter September 2013

Subscribe to this newsletter here

SQL Popularity and Controversy

Together, the above articles have have reached out to more than 200’000 readers on the jOOQ Blog and through our syndication partners in only one month.

This has been topped by two more, very interesting articles about Prof. Michael Stonebraker’s recent claims. In “The Traditional RDBMS Wisdom is All Wrong”, Stonebraker, who has given us nothing less than Ingres, Postgres, Vertica, Streambase, Illustra, VoltDB, SciDB shows that he is also well-known for his stirring of controversy.

Read also the follow-up article Teaching an Old Elephant New Tricks, where Stonebraker’s claims are addressed by Oracle and SQL Server.

With Stonebraker’s (and others’) efforts to bring SQL and “NewSQL” back to the market, SQL’s popularity is bound to be on the rise again. Few other languages stir so many emotions when it comes to commenting on tutorials like 10 Easy Steps to a Complete Understanding of SQL, which triggered lots of discussion both on reddit, and on hackernews.

jOOQ Public Talks and Trainings

jOOQ responds to SQL’s popularity by embracing SQL as a first-class citizen into your stack. Curious about how jOOQ works, or how it could work in your organisation? Then join any of the upcoming events:

  • The jOOQ training sessions at the /ch/open workshop days in Zurich, Switzerland (in German)
  • The jOOQ introductory session at the JUGH in Kassel, Germany (in German)

Feel free to contact us at contact@datageekery.com if you’re interested in a Training session close to you.

Column Stores: Teaching an Old Elephant New Tricks

Prof. Michael Stonebraker is a controversial visionary, who is known for nothing less than Ingres, Postgres, Vertica, Streambase, Illustra, VoltDB, SciDB, besides being a renowned MIT professor. My recent blog post about Stonebraker’s talk at the EPFL (host university to Prof. Martin Odersky, creator of the Scala Language and Co-Founder of Typesafe) has triggered a very interesting discussion on reddit.

While Stonebraker is very sure about his obviously biased claims that “The Traditional RDBMS Wisdom is All Wrong”, the bottom line of the reddit discussion included:

Interesting insight on SQL Server’s enhancement can be seen in this blog post by Microsoft’s Nicolas Bruno, who challenges the fact that column stores cannot be implemented by “traditional RDBMS”. As Nicolas Bruno stated, an “Old Elephant” can be taught new tricks. “Traditional RDBMS” have proven to adapt to long-term trends in the database industry. Their success isn’t based around the fact that they are mainly fast, or particularly well-designed to respond to niche problem domains. Their success is mainly based on the fact that they are designed according to Codd’s 12 Rules, and thus to be extremely flexible in how they separate data interfacing (SQL) from data storage.

A lot of additional insight and ongoing links can be found in these blog posts by Daniel Lemire, where he had challenged Stonebraker’s similar claims already four years ago:

MIT Prof. Michael Stonebraker: “The Traditional RDBMS Wisdom is All Wrong”

A very interesting talk about the future of DBMS was recently given at EPFL by MIT Professor and VoltDB Co-founder and CTO Michael Stonebraker, who also gave us Ingres and Postgres. In a bit less than one hour, he explains his views with respect to the three main pillars of database management systems:

  • OLAP / Data warehouses
  • OLTP
  • Other types of data stores

As a NewSQL vendor also actively involved with H-Store, he is of course heavily yet refreshingly biased towards traditional RDBMS storage models being obsolete (an interesting fact is that Oracle Labs representative Eric Sedlar also attended the talk. One might think that the talk was a slighly FUD-dy move against a VoltDB competitor). Unlike what has come to be known as the NoSQL movement, NewSQL relies on similar relational theory / set theory as “traditional SQL”, including support for ACID and structured data.

His claims mainly include that:

  • OLAP / data warehouses will migrate to column-based data stores within 10 years. The traditional row-based data storage approach is dead, as row-based storage will never match column-based storage’s performance increase by factor 100x.
  • For OLTP, the race for the best data storage designs has not yet been decided, but there is a clear indication of classic models being “plain wrong” (according to Stonebraker), as only 4% of wall-clock time is spent on useful data processing, while the rest is occupied with buffer pools, locking, latching, recovery.
Image from Stonebraker's presentation depicting the amount of "useful" work performed by any RDBMS

Image from Stonebraker’s presentation depicting the amount of “useful” work performed by any RDBMS

I specifically recommend the OLTP part of his talk, as it shows how various new techniques could heavily increase performance of traditional RDBMS already today:

  • Most OLTP systems can afford to buy the amount of memory needed to keep data off the disk. This will remove the need for a buffer pool.
  • Single-threading would get rid of the latching overhead. H-Store and VoltDB statically divide shared memory among the cores, for instance. This is very important as latching gets worse and worse with the increasing amount of cores we have, today.
  • Dynamic locking is not really implemented in any popular RDBMS, but the market is uncertain, which workaround best implements concurrency control. In his opinion, MVCC is not going to do the trick in the long run.
  • ACIDness is something that even Jeff Dean from Google admits to miss, once it’s gone, as eventual consistency does not really keep its promise.
  • In a cluster, active-active consistency management can increase log throughput by factor 3x, compared to active-passive logging. (active-active = transaction is run on every node, active-passive = transaction is run only on the master node, the log is sent to all slave nodes)
  • And also, very importantly, anti-caching is a good technique when the in-memory format matches the disk format, as traditional RDBMS spend a substantial amount of time converting disk data formats (blocks, sectors) into memory formats (actual data).

The essence of Stonebraker’s talk is that the “elephants” who currently dominate the market are too slow to react to all the NewSQL vendors’ innovations. It is a very exciting time for a database professional (some refer to them as data geeks) to enter the market and publish new findings.

Another interesting thing to note is that SQL (call it NewSQL, OldSQL) will remain a dominant language for querying DBMS, both for column-stores as for row-stores. This is a strong statement for tools like jOOQ, which embrace SQL as a first-class citizen among programming languages.

See the complete talk by Michael Stonebraker here:

See Stonebraker's Talk here: http://slideshot.epfl.ch/play/suri_stonebraker

See Stonebraker’s Talk here: http://slideshot.epfl.ch/play/suri_stonebraker

Further reading: