Why Your Boring Data Will Outlast Your Sexy New Technology

So you’re playing around with all those sexy new technologies, enjoying yourself, getting inspiration from state-of-the-art closure / lambda / monads and other concepts-du-jour

Now that I have your attention provoking a little anger / smirk / indifference, let’s think about the following. I’ve recently revisited a great article by Ken Downs written in 2010. There’s an excellent observation in the middle.

[…] in fact there are two things that are truly real for a developer:

  • The users, who create the paycheck, and
  • The data, which those users seemed to think was supposed to be correct 100% of the time.

He says this in the context of a sophisticated rant against ORMs. It can also be seen as a rant against any recent abstraction or database technology with a mixture of nostalgia and cynicism. But the essence of his article lies in another observation that I’ve made with so many software systems:

  • Developers tend to get bored with “legacy”
  • Systems tend to outlast their developers
  • Data tends to outlast the systems operating on it

Having said so, we can quickly understand why Java has become the new COBOL to some, who are bored with its non-sexiness (specifically in the JEE corner). Yet, according to TIOBE, Java and C are still the most widely used programming languages in the market and many systems are thus implemented in Java.

On the other hand, many inspiring new technologies have emerged in the last couple of years, also in the JVM field. A nice overview is given in ZeroTurnaround’s The Adventurous Developer’s Guide to JVM Languages. And all of that is fine to a certain extent. As Bruno Borges from Oracle recently and adequately put it:

anything not mainstream has more odds to be “sexy” [than JSF]

Now, let’s map this observation back to a subsequent section of Ken’s article:

[…] the application code suddenly becomes a go-between, the necessary appliance that gets data from the db to the user […] and takes instructions back from the user and puts them in the database […]. No matter how beautiful the code was, the user would only ever see the screen […] and you only heard about it if it was wrong. Nobody cares about my code, nobody cares about yours.

Think about an E-Banking system. None of your users really cares about the code you wrote to get it running. What they care about is their data (i.e. their money) and the fact that it is correct. This is true for many many systems where the business logic and UI layers can be easily replaced with fancy new technology, whereas the data stays around.

In other words, you are free to choose whatever sexy new UI technology you like as long as it helps your users get access to their data.

So what about sexy new database technology?

Now, that’s an entirely different story, compared to sexy new UI technology.

You might be considering some NoSQL-solution-du-jour to “persist” your data, because it’s so easy and because it costs so much less. Granted, the cost factor may seem very tempting at first. But have you considered the fact that:

  • Systems tend to outlast their developers
  • Data tends to outlast the systems operating on it

Once your data goes into that NoSQL store, it may stay there much longer than you had wanted it to. Your developers and architects (who originally chose this particular NoSQL solution) may have left long ago. Parts of your system may have been replaced, too, because now you’re doing everything in HTML5. JavaScript is the new UI technology.

And all this time, you have “persisted” UI / user / domain model data in your database, from various systems, from various developers, through various programming paradigms. And then, you realise:

We’re not saying that there aren’t some use-cases where NoSQL databases really provide you with a better solution than the “legacy” alternatives. Most specifically, graph databases solve a problem that no RDBMS has really solved well, yet.

But consider one thing. You will have to migrate your data. Time and again. You will have to archive it. And maybe, migrate the archive. And maybe provide reports of the archive. And provide correct reports (which means: ACID!) And be transactional (which means: ACID!) And all the other things that people do with data.

In fact, your system will grow like any other system ever did before and it will have high demands from your database. While some NoSQL databases have started to get some decent traction, in a way that it is safe to say their vendors might still be around in 5-10 years, when the systems will have been replaced by developers who have replaced other developers.

In other words, there is one more bullet to this list:

  • Developers tend to get bored with “legacy”
  • Systems tend to outlast their developers
  • Data tends to outlast the systems operating on it
  • Data might tend to outlast the vendors providing tools for operating the data

Beware of the fact that data will probably outlast your sexy new technology. So, choose wisely when thinking about sexy new technology to operate your data.

MIT Prof. Michael Stonebraker: “The Traditional RDBMS Wisdom is All Wrong”

A very interesting talk about the future of DBMS was recently given at EPFL by MIT Professor and VoltDB Co-founder and CTO Michael Stonebraker, who also gave us Ingres and Postgres. In a bit less than one hour, he explains his views with respect to the three main pillars of database management systems:

  • OLAP / Data warehouses
  • OLTP
  • Other types of data stores

As a NewSQL vendor also actively involved with H-Store, he is of course heavily yet refreshingly biased towards traditional RDBMS storage models being obsolete (an interesting fact is that Oracle Labs representative Eric Sedlar also attended the talk. One might think that the talk was a slighly FUD-dy move against a VoltDB competitor). Unlike what has come to be known as the NoSQL movement, NewSQL relies on similar relational theory / set theory as “traditional SQL”, including support for ACID and structured data.

His claims mainly include that:

  • OLAP / data warehouses will migrate to column-based data stores within 10 years. The traditional row-based data storage approach is dead, as row-based storage will never match column-based storage’s performance increase by factor 100x.
  • For OLTP, the race for the best data storage designs has not yet been decided, but there is a clear indication of classic models being “plain wrong” (according to Stonebraker), as only 4% of wall-clock time is spent on useful data processing, while the rest is occupied with buffer pools, locking, latching, recovery.
Image from Stonebraker's presentation depicting the amount of "useful" work performed by any RDBMS

Image from Stonebraker’s presentation depicting the amount of “useful” work performed by any RDBMS

I specifically recommend the OLTP part of his talk, as it shows how various new techniques could heavily increase performance of traditional RDBMS already today:

  • Most OLTP systems can afford to buy the amount of memory needed to keep data off the disk. This will remove the need for a buffer pool.
  • Single-threading would get rid of the latching overhead. H-Store and VoltDB statically divide shared memory among the cores, for instance. This is very important as latching gets worse and worse with the increasing amount of cores we have, today.
  • Dynamic locking is not really implemented in any popular RDBMS, but the market is uncertain, which workaround best implements concurrency control. In his opinion, MVCC is not going to do the trick in the long run.
  • ACIDness is something that even Jeff Dean from Google admits to miss, once it’s gone, as eventual consistency does not really keep its promise.
  • In a cluster, active-active consistency management can increase log throughput by factor 3x, compared to active-passive logging. (active-active = transaction is run on every node, active-passive = transaction is run only on the master node, the log is sent to all slave nodes)
  • And also, very importantly, anti-caching is a good technique when the in-memory format matches the disk format, as traditional RDBMS spend a substantial amount of time converting disk data formats (blocks, sectors) into memory formats (actual data).

The essence of Stonebraker’s talk is that the “elephants” who currently dominate the market are too slow to react to all the NewSQL vendors’ innovations. It is a very exciting time for a database professional (some refer to them as data geeks) to enter the market and publish new findings.

Another interesting thing to note is that SQL (call it NewSQL, OldSQL) will remain a dominant language for querying DBMS, both for column-stores as for row-stores. This is a strong statement for tools like jOOQ, which embrace SQL as a first-class citizen among programming languages.

See the complete talk by Michael Stonebraker here:

See Stonebraker's Talk here: http://slideshot.epfl.ch/play/suri_stonebraker

See Stonebraker’s Talk here: http://slideshot.epfl.ch/play/suri_stonebraker

Further reading: