The Crystal Ball. Or, Oops, Michael Stonebraker did it Again

Michael Stonebraker’s opinions and claims are always refreshing to read. He’s done a lot for our industry and for how we do data processing. Some of his claims are certainly right as well. Here’s an interview with him, telling us about his 5 predictions on the future of databases.

Of course, him being a software vendor, many of his claims should be read with caution. Today, the most popular DBMS (relational or not) are still Oracle, MySQL, and SQL Server. Even his “popular” PostgreSQL is still a niche player, let alone the almost forgotten Ingres and the never really popular Vertica columnar “NewSQL” database. Obviously, we’re not saying they’re bad databases, but they’re certainly not very popular. The same goes with SAP. Their Sybase databases have been surpassed by SQL Server both in quality and in popularity 10 years ago, when Microsoft forked the Sybase code into T-SQL. We hardly believe that Oracle and Sybase will have the “final fight” for RDBMS supremacy.

But again. That’s Mike Stonebraker, the salesman as in “The Traditional RDBMS Wisdom is All Wrong”

Auto-Creation of Indexes in RDBMS

[…] generally speaking, I’m also surprised to see that in 2013 we’re creating our indexes manually.

Interesting thought! Has this thought ever occurred to you?

How this comment came about

Hackernews is very predictable. Our latest pro-SQL marketing campaign for jOOQ got quite a bit of traction as expected. It is easy to trigger love and hate for NoSQL databases with a little bit of humour, such as with Mark Madsen’s “history of databases in no-tation”.

A much more interesting and more serious blog post is Doug Turnbull’s “Codd’s Relational Vision – Has NoSQL Come Full Circle?”, which we are going to blog about soon, in a separate post. We’ve also put the latter on Hackernews and on Reddit, both of which generated tremendous traction for the subject. Comparing the current state of “NoSQL” with pre-Codd, pre-relational, pre-SQL times is clever and matches Michael Stonebraker’s and Joseph M. Hellerstein’s observations in “What Goes Around Comes Around”.

NoSQL is a movement that emerged out of necessity, when SQL databases did not evolve fast enough to keep up with what keen observers and Gartner now call “Webscale”, a new buzzword to name old things. But as history has shown, the old elephants can be taught new tricks, and SQL databases will eventually catch up.

Auto-creation of indexes in RDBMS

In the middle of the above Hackernews discussion, MehdiEG made this interesting observation about creating indexes manually being tedious. Indeed, why do we have to maintain all of our indexes manually? While platforms like Use-The-Index-Luke.com profit from teaching people how to do proper indexing, I wonder if a highly sophisticated database couldn’t gather statistics about a productive system and then generate suggestions for index additions / removals. Even more so, if the database is “absolutely sure”, it could also create/drop or at least activate/deactivate relevant indexes.

What does “absolutely sure” mean?

The Oracle database for instance, is already quite good at gathering relevant statistics and giving DBA hints about potentially effective new indexes, as it can simulate execution plans in case indexes were added. Some more information can be seen in this Stack Overflow question.

But wouldn’t it be great if Oracle (or SQL Server, DB2, any other database) had an auto-index-creation feature? On a productive system, the database could gather statistics for the longest-running queries, analyse their execution plans, simulate alternative execution plans in case potentially useful indexes were added to improve SELECT statements, or removed to improve INSERT, UPDATE, DELETE, MERGE statements. This wouldn’t be a simple task, as all available (or at least the 100 most executed) execution plans would have to be re-calculated to see how the newly added or removed index would impact the productive system.

There are a couple of things to note here:

  1. Fine-tuning indexing is easiest on a productive system. If you’re tuning your development environment, you will get most of the cases right. But only the productive system will show all those weird edge-cases that you simply cannot foresee
  2. Analysing the productive system is hard and is usually performed by the devops team or the DBA team. They’re often not the same people as the ones who developed the application / database. Since they often cannot access the DML or DDL of the application, it’s always good if they have some automatic tuning features such as the existing cost-based optimiser
  3. Blindly adding indexes without measuring is bad practice. If you know that a table is mostly-read-only, then you’re mostly-on-the-safe-side. But what happens if a table is often bulk updated? If a batch job creates large transactions with long UNDO / REDO logs? Each unnecessary index will only slow down the batch job, increasing the risk of race conditions, rollbacks or even deadlocks.

Automatic index creation or deletion could greatly improve the productive experience with commercial databases that already have many very useful tuning features. Let us hope that Oracle, IBM, Microsoft will hear us and build such a feature into their future databases!

A History of Databases in “No-tation”

We’re heading towards very exciting times in the field of databases!

At Topconf in beautiful Tallin, Estonia, Nikita Ivanov (founder and CEO of GridGain Systems) was talking about how the ever crumbling price of DRAM gets in-memory computing and thus in-memory databases within the reach of being affordable by even small and medium enterprises. Nikita claims that 99% of all companies have less than 10TB of transactional data. While this has been completely impossible ten years ago, nowadays, you can store that much data in memory for less than 15000 USD! Compared to the Oracle license that you might buy with the server, that’s almost nothing. Imagine that you can scale up several orders of magnitude without changing your “legacy” architecture. Without switching to something like NoSQL.

A day before, Christoph Engelbert presented Hazelcast, a competitor product of GridGain Systems. Unfortunately, I couldn’t attend his talk but I was lucky enough to spend a couple of hours with Christoph on the flight back home. He’s a very interesting and fun guy to talk to and gave me quite some insight about what his company is evangelising in the context of “Big Data”. Essentially, modern data processing involves moving computation towards data, instead of moving data towards computation. While Hazelcast solves this through their own storage mechanisms, this paradigm has been equally true for “legacy” OLAP systems based on relational databases. Using PL/SQL, or T-SQL, or any other procedural language, you can execute complex algorithms right where the data is: In your database.

For those of you frequently following my blog, you will not be surprised that I am very thrilled about the above evolutions in data computing. The ever increasing consternation with ORMs and the big amount of confusion about the future of “NoSQL” have lead to a recent revival of SQL as a language.

Back to the roots.

This seems to have culminated at the recent O’Reilly Strata Conference, where Mark Madsen, a popular researcher and analyst was walking around with a geeky T-Shirt showing the History of NoSQL. I’ve had a brief chat with him on Twitter. He might be selling this T-Shirt, if it goes viral.

History of NoSQL by Mark Madsen. Picture by Ed Dumbill
History of NoSQL by Mark Madsen. Picture published by Edd Dumbill

So apparently, SQL is back, and strong as ever!

CUBRID: A Lesser-Known Korean OSS Database Gem

While RedHat and Google have been dumping MySQL for MariaDB, there’s actually a third, much lesser-known option for MySQL-oriented database folks in the RDBMS market: CUBRID.

One of CUBRID’s main goals is also to lure MySQL users away from Oracle by offering many equivalent syntax elements that are available in either the MySQL or Oracle databases. But this gem from the far east, which is backed by the Korean Naver Corp. offers a lot more. It is one of the few Open Source RDBMS that

  • is also an ORDBMS (only PostgreSQL can compete with this)
  • implements the SQL standard MERGE statement (HSQLDB and Firebird do as well, Derby is currently implementing it)
  • implements a wide variety of window functions (only PostgreSQL competes, again)
  • implements a wide variety of MySQL’s proprietary SQL extensions
  • implements Oracle’s awesome CONNECT BY syntax (no other database does that)

Popularity-wise, there’s surely much to catch up with, comparing with MySQL and MariaDB:

Reproduced with permission by DB-Engines.com
Reproduced with permission by DB-Engines.com

But the next time you’re looking at a replacement for MySQL, why not also consider CUBRID?

The 10 Most Popular DB Engines (SQL and NoSQL)

How to objectively measure the popularity of a DB engine? Good question! And there’s an Austrian company (Solid IT) who claims to have the answer. The company focuses on “Big Data und NoSQL“, but this focus does not seem to have biased the result of the measurement. Among the top 10 database engines, there is only MongoDB, which is not an RDBMS. And it’s astonishing just how popular MongoDB seems to be (although, they must be doing something right)!

Reproduced with permission of DB-Engines.com
Reproduced with permission of DB-Engines.com

Now, I’m not surprised by the top 3. I am definitely surprised by the fact that PostgreSQL and SQLite are not more popular. I am also surprised, that there aren’t more “wide-column stores” among the top 10. Maybe, Michael Stonebraker has to review his claims about the traditional RDBMS wisdom being all wrong?

And what about the other databases supported by jOOQ? Where are the Java databases? Here’s a condensed view of the ranking, consisting only of the 15 databases currently supported by jOOQ 3.1:

dbms-ranking-jooq
Reproduced with permission of DB-Engines.com

It turns out that Java databases (Derby, H2, HyperSQL) are not so popular compared to all the others. It also turns out that MariaDB still has a lot of grounds to gain, compared to MySQL.

The ranking considers a lot of data from various somewhat authoritative sources as is explained here. These include:

  • Number of mentions of the system on websites. Measured through search engine results.
  • General interest in the system. Measured through Google Trends.
  • Frequency of technical discussions about the system. Measured through Stack Overflow and similar.
  • Number of job offers, in which the system is mentioned. Measured through Indeed and similar.
  • Number of profiles in professional networks, in which the system is mentioned. Measured through LinkedIn.

This ranking is certainly something to keep an eye on!

The Premature Return to SQL

In online communities, the NoSQL topic (much like the ORM topic) is a guarantee to stir emotions. Many emotions are stirred by evangelists on either side for ideological or marketing reasons. Here’s an interesting post by Alex Popescu, a passionate NoSQL and polyglot persistence evangelist, claiming that the recent trend to return to SQL is premature:

This post triggered an equally interesting reaction by Markus Winand, author of SQL Performance Explained:

It’s really interesting, how often people think in terms of “trends” that introduce novel paradigms, obsoleting all we had before. I believe that these are not trends, but experiments. I’ve blogged before that you should be wary when NoSQL vendors promise you to put an end to DBAs. Very few “new” solutions or paradigms have ever completely replaced or substituted their predecessors. Or, in Isaac Newton’s words:

If I have seen further it is by standing on the shoulders of giants.

We’re not “returning to SQL”, nor is such a return “premature”. Yes, there are some innovative thinkers who are teaching an old elephant new tricks, and that’s good. It’s also good that such innovative thinkers get a piece of the cake and make money with their inventions.

It is also true that big database vendors are not very innovative. But they don’t have to be. Their asset is reliability, predictability, stability. Oracle SQL will still support all its age-old legacy in 15 years, which makes it a safe choice for banks and insurance companies. If a NoSQL or NewSQL feature proves to be innovative and reliable, Oracle et al. will most certainly pick it up and integrate it into SQL. Clever NoSQL vendors thus already prepare for their exits.

This happens outside the world of databases, of course:

  • Scala is innovative and contributes to Java (Generics in Java 5, Lambdas in Java 8).
  • Open Source developers (e.g. those of JAX-RS) are innovative and contribute to JEE.
  • PostgreSQL is innovative and contributes to other SQL dialects and eventually the SQL standard.
  • Instagram is innovative and contributed to Facebook (“shit happens!”).
  • jOOQ is innovative and contributes to JDBC and JPA (eventually, hopefully).

SQL is a safe bet and is here to stay.

Why do We Need RDBMS?

There was this recent Quora question about why we need RDBMS:

Why not just use text files? What can RDBMS do that a simple text file cannot? Or, why not use several different text files to represent different tables?

Heh. Let’s challenge that through a witty comparison (also given as an answer to the above question)…

Short story (not to be taken too seriously):

Some people just put their keys, wallets, make-up, letters, pencils, more make-up, change, and all the other stuff in a huge purse, spending hours to find stuff when we need to catch the train. Stuff, which they might have actually put in that other purse. Let’s call this purse the text file

I like to structure my stuff. My index says: Wallet in the back pocket, key in the front right pocket, mobile phone in the front left pocket, glasses on my nose. Let’s call this structure the RDBMS.

;-)

Long story:

This Quora question is really interesting in this context: Why did relational database technologies gain traction? What were the historical competing technologies?

Essentially, there had been a single most important driving force at the time, pushing RDBMS way ahead of all alternative storage models: Relational algebra itself, designed mostly by Edgar F. Codd, a brilliant computer scientist of his times.

Not only did popular relational database management systems take care of actually managing data, data structures, physical models, transactions, query models, a powerful query language (implemented as SQLjOOQJPQLLINQ to SQLand various other dialects / APIs), referential intergrity, constraint management etc, etc., they were also based on a very very powerful conceptual model and implementation rules (Codd’s 12 rules). The relational data model can easily model almost all business rules.

So, of course you can write your own data management system. Or you use a proven one that does millions of things for you according to very proven rules conceived by very bright people that got very rich with their systems.

Column Stores: Teaching an Old Elephant New Tricks

Prof. Michael Stonebraker is a controversial visionary, who is known for nothing less than Ingres, Postgres, Vertica, Streambase, Illustra, VoltDB, SciDB, besides being a renowned MIT professor. My recent blog post about Stonebraker’s talk at the EPFL (host university to Prof. Martin Odersky, creator of the Scala Language and Co-Founder of Typesafe) has triggered a very interesting discussion on reddit.

While Stonebraker is very sure about his obviously biased claims that “The Traditional RDBMS Wisdom is All Wrong”, the bottom line of the reddit discussion included:

Interesting insight on SQL Server’s enhancement can be seen in this blog post by Microsoft’s Nicolas Bruno, who challenges the fact that column stores cannot be implemented by “traditional RDBMS”. As Nicolas Bruno stated, an “Old Elephant” can be taught new tricks. “Traditional RDBMS” have proven to adapt to long-term trends in the database industry. Their success isn’t based around the fact that they are mainly fast, or particularly well-designed to respond to niche problem domains. Their success is mainly based on the fact that they are designed according to Codd’s 12 Rules, and thus to be extremely flexible in how they separate data interfacing (SQL) from data storage.

A lot of additional insight and ongoing links can be found in these blog posts by Daniel Lemire, where he had challenged Stonebraker’s similar claims already four years ago:

MIT Prof. Michael Stonebraker: “The Traditional RDBMS Wisdom is All Wrong”

A very interesting talk about the future of DBMS was recently given at EPFL by MIT Professor and VoltDB Co-founder and CTO Michael Stonebraker, who also gave us Ingres and Postgres. In a bit less than one hour, he explains his views with respect to the three main pillars of database management systems:

  • OLAP / Data warehouses
  • OLTP
  • Other types of data stores

As a NewSQL vendor also actively involved with H-Store, he is of course heavily yet refreshingly biased towards traditional RDBMS storage models being obsolete (an interesting fact is that Oracle Labs representative Eric Sedlar also attended the talk. One might think that the talk was a slighly FUD-dy move against a VoltDB competitor). Unlike what has come to be known as the NoSQL movement, NewSQL relies on similar relational theory / set theory as “traditional SQL”, including support for ACID and structured data.

His claims mainly include that:

  • OLAP / data warehouses will migrate to column-based data stores within 10 years. The traditional row-based data storage approach is dead, as row-based storage will never match column-based storage’s performance increase by factor 100x.
  • For OLTP, the race for the best data storage designs has not yet been decided, but there is a clear indication of classic models being “plain wrong” (according to Stonebraker), as only 4% of wall-clock time is spent on useful data processing, while the rest is occupied with buffer pools, locking, latching, recovery.
Image from Stonebraker's presentation depicting the amount of "useful" work performed by any RDBMS
Image from Stonebraker’s presentation depicting the amount of “useful” work performed by any RDBMS

I specifically recommend the OLTP part of his talk, as it shows how various new techniques could heavily increase performance of traditional RDBMS already today:

  • Most OLTP systems can afford to buy the amount of memory needed to keep data off the disk. This will remove the need for a buffer pool.
  • Single-threading would get rid of the latching overhead. H-Store and VoltDB statically divide shared memory among the cores, for instance. This is very important as latching gets worse and worse with the increasing amount of cores we have, today.
  • Dynamic locking is not really implemented in any popular RDBMS, but the market is uncertain, which workaround best implements concurrency control. In his opinion, MVCC is not going to do the trick in the long run.
  • ACIDness is something that even Jeff Dean from Google admits to miss, once it’s gone, as eventual consistency does not really keep its promise.
  • In a cluster, active-active consistency management can increase log throughput by factor 3x, compared to active-passive logging. (active-active = transaction is run on every node, active-passive = transaction is run only on the master node, the log is sent to all slave nodes)
  • And also, very importantly, anti-caching is a good technique when the in-memory format matches the disk format, as traditional RDBMS spend a substantial amount of time converting disk data formats (blocks, sectors) into memory formats (actual data).

The essence of Stonebraker’s talk is that the “elephants” who currently dominate the market are too slow to react to all the NewSQL vendors’ innovations. It is a very exciting time for a database professional (some refer to them as data geeks) to enter the market and publish new findings.

Another interesting thing to note is that SQL (call it NewSQL, OldSQL) will remain a dominant language for querying DBMS, both for column-stores as for row-stores. This is a strong statement for tools like jOOQ, which embrace SQL as a first-class citizen among programming languages.

See the complete talk by Michael Stonebraker here:

See Stonebraker's Talk here: http://slideshot.epfl.ch/play/suri_stonebraker
See Stonebraker’s Talk here: http://slideshot.epfl.ch/play/suri_stonebraker

Further reading:

A map of all those new NoSQL, NewSQL, post-SQL, structured, unstructured database options that came out over the past year

So you want to go with the flow and implement your next application on top of some NoSQL, NotJustSQL, NewSQL, AlmostSQL, SQL++, NextGenSQL, and what not, just to be sure not to miss out on some of the latest developments in the data business? Here’s a little map to guide you through the jungle of choices:

http://gigaom.com/cloud/confused-by-the-glut-of-new-databases-heres-a-map-for-you/

… or you just stick with the relational data model and some decent RDBMS like Oracle, SQL Server, or Postgres and wait until things settle a little bit :-)