A History of Databases in “No-tation”

We’re heading towards very exciting times in the field of databases!

At Topconf in beautiful Tallin, Estonia, Nikita Ivanov (founder and CEO of GridGain Systems) was talking about how the ever crumbling price of DRAM gets in-memory computing and thus in-memory databases within the reach of being affordable by even small and medium enterprises. Nikita claims that 99% of all companies have less than 10TB of transactional data. While this has been completely impossible ten years ago, nowadays, you can store that much data in memory for less than 15000 USD! Compared to the Oracle license that you might buy with the server, that’s almost nothing. Imagine that you can scale up several orders of magnitude without changing your “legacy” architecture. Without switching to something like NoSQL.

A day before, Christoph Engelbert presented Hazelcast, a competitor product of GridGain Systems. Unfortunately, I couldn’t attend his talk but I was lucky enough to spend a couple of hours with Christoph on the flight back home. He’s a very interesting and fun guy to talk to and gave me quite some insight about what his company is evangelising in the context of “Big Data”. Essentially, modern data processing involves moving computation towards data, instead of moving data towards computation. While Hazelcast solves this through their own storage mechanisms, this paradigm has been equally true for “legacy” OLAP systems based on relational databases. Using PL/SQL, or T-SQL, or any other procedural language, you can execute complex algorithms right where the data is: In your database.

For those of you frequently following my blog, you will not be surprised that I am very thrilled about the above evolutions in data computing. The ever increasing consternation with ORMs and the big amount of confusion about the future of “NoSQL” have lead to a recent revival of SQL as a language.

Back to the roots.

This seems to have culminated at the recent O’Reilly Strata Conference, where Mark Madsen, a popular researcher and analyst was walking around with a geeky T-Shirt showing the History of NoSQL. I’ve had a brief chat with him on Twitter. He might be selling this T-Shirt, if it goes viral.

History of NoSQL by Mark Madsen. Picture by Ed Dumbill
History of NoSQL by Mark Madsen. Picture published by Edd Dumbill

So apparently, SQL is back, and strong as ever!

Why Staying in Control of Your SQL is so Important

Lots of blog posts and research papers are written about the topics of scaling up and scaling out. This interesting blog post, for instance, sheds some light on the two strategies with respect to physical maintenance costs, such as cooling and electricity consumption. Certainly non-negligible aspects for very large systems.

But before solving problems at a very big scale, consider much simpler SQL tuning mechanisms. Very often, when you have a bottleneck in your application, it is at the database layer. This fact is used by many NoSQL evangelists to promote their products, claiming that scaling out is much easier with NoSQL databases. This might be true, but ask yourself: Do you need a system that works under heavy load? Or is your bottleneck a performance bottleneck?

In other words: Do you have a 5’000’000-concurrent-users problem? Or do you have a request-takes-more-than-3-seconds problem? Because if you suffer from the latter, you probably do not need to scale out nor up. Your “traditional” architecture is probably quite fine, but your database / SQL queries aren’t. The popular use-the-index-luke.com website features an interesting article about the two top performance problems caused by ORM tools. They are:

  • The infamous N+1 selects problem
  • The hardly-known Index-Only Scan

Both problems result from the fact that popular ORMs generate SQL code in a way that can hardly be influenced by developers. People often claim that tools like Hibernate generate better SQL than the average developer. This is true to an extent that the average developer might never care to actually learn to write better SQL. Which in turn leads to the above problems.

Hibernate is very good at generating 70% of your application’s boring CRUD SQL. At the same time, Hibernate never claimed to be a replacement for SQL as you will have to resort to native SQL in 30% of the time. Should you then use Hibernate’s native SQL API? Or an alternative like jOOQ?

The important thing is to get back in control of your SQL when performance matters. Know your indexes. Know your database meta data. And use a tool that allows you to write precisely the SQL statement you want. Learning better SQL will help you save lots of money on operations costs as you might not need to scale out nor up.

jOOQ Newsletter October 10, 2013

Subscribe to this newsletter here.

jOOQ 3.2 Released

After a bit of time, jOOQ 3.2 has finally been released. This interesting release mainly includes two new SPIs (Service Provider Interfaces), which allow for:

  • Injecting pre and post CRUD operation behaviour, which is useful for global ID generators.
  • Injecting behaviour into the SQL rendering lifecycle allowing for arbitrary SQL transformations. This is very useful for things like shared-schema multi-tenancy.

Besides, the code generator has seen lots of improvements. More details can be seen in the release notes.

jOOQ Licensing and Support

With the new jOOQ 3.2, apart from introducing great new features, we are changing quite a few things on how we operate. At Data Geekery GmbH, we believe in Open Source. But we also believe in the creative power enabled by commercial software. This is why we have chosen to implement a dual-licensing strategy. Read more about this strategy here:

https://blog.jooq.org/2013/10/09/jooq-3-2-offering-commercial-licensing-and-support

SQL2jOOQ

A very interesting open source tool is being developed by a third-party vendor called GUDU Soft in cooperation with Data Geekery: SQL2jOOQ. This tool is based on GUDU Soft’s SQL Parser application, which can parse a variety of text-based SQL strings, transform ASTs and render new dialects. One of these output dialects is jOOQ. For jOOQ developers migrating large amounts of legacy SQL to jOOQ, this will be an invaluable tool in the tool chain.

MongoDB and NoSQL Heat

MongoDB was in the news big time last week when they announced their recent raise of $150 million in a venture-funding round. While cynics claimed that they failed to write the transaction to disk, this is still a very important milestone for 10gen. After all, MongoDB is the only non-relational DBMS figuring in a quite objective top 10 rating considering 194 systems.

To SQL or not to SQL? Our take from the jOOQ perspective is clear. Any competitor in the market dominated by Oracle and SQL Server is good, even (or maybe because) it is a non-SQL vendor:

https://blog.jooq.org/2013/10/02/the-premature-return-to-sql/

On MongoDB’s Success. Or Do Not Let Cynicism Kill Your Spirit

Success is a strange beast. On the bright side and true to entrepreneurial spirit, people admire those who are obviously successful. In the theory of our mostly capitalist societey, successful people worked hard for their success, and thus deserve it. In practice, there may be other mechanisms of success than working hard, but that’s another story which is quite off-topic on this blog.

MongoDB is successful. Extremely successful, as they have just raised $150M from a venture-funding round. They must be doing something right, at least from a marketing and sales perspective. As I admire this success, and as I am hoping to be similarly successful with jOOQ, I wanted to throw in the story at reddit for discussion. I’m curious what other business-oriented people think of this. But as always, when talking about MongoDB, cynicism arose. Heavily. See the following examples

Cynicism about the tech quality

Cynicism on the tech quality
Cynicism about the tech quality

Cynicism about getting rich

Cynicism about getting rich
Cynicism about getting rich

This was really disappointing to me. Is this just reddit being full of trolls? Or are people heavily envious if someone stands out and is successful? You may think about MongoDB whatever you want. But there is no doubt that this valuation and venture-funding round is a proof of MongoDB’s continuous success over the past years. Apparently, they are responding to a real market value with real market expectations. Raising this money will hopefully help them meet those expectations even more, in the future.

Do not let cynicism kill your spirit

No matter if this is just reddit or the “IT community” in general, it is not worth delving into such cynical discussions. I congratulate 10gen for being the only NoSQL database vendor in the DBMS popularity ranking that I’ve previously posted:

DBMS ranking. Reproduced with permission of DB-Engines.com
DBMS ranking. Reproduced with permission of DB-Engines.com

And I most certainly congratulate 10gen to this incredible milestone in their company history.

The 10 Most Popular DB Engines (SQL and NoSQL)

How to objectively measure the popularity of a DB engine? Good question! And there’s an Austrian company (Solid IT) who claims to have the answer. The company focuses on “Big Data und NoSQL“, but this focus does not seem to have biased the result of the measurement. Among the top 10 database engines, there is only MongoDB, which is not an RDBMS. And it’s astonishing just how popular MongoDB seems to be (although, they must be doing something right)!

Reproduced with permission of DB-Engines.com
Reproduced with permission of DB-Engines.com

Now, I’m not surprised by the top 3. I am definitely surprised by the fact that PostgreSQL and SQLite are not more popular. I am also surprised, that there aren’t more “wide-column stores” among the top 10. Maybe, Michael Stonebraker has to review his claims about the traditional RDBMS wisdom being all wrong?

And what about the other databases supported by jOOQ? Where are the Java databases? Here’s a condensed view of the ranking, consisting only of the 15 databases currently supported by jOOQ 3.1:

dbms-ranking-jooq
Reproduced with permission of DB-Engines.com

It turns out that Java databases (Derby, H2, HyperSQL) are not so popular compared to all the others. It also turns out that MariaDB still has a lot of grounds to gain, compared to MySQL.

The ranking considers a lot of data from various somewhat authoritative sources as is explained here. These include:

  • Number of mentions of the system on websites. Measured through search engine results.
  • General interest in the system. Measured through Google Trends.
  • Frequency of technical discussions about the system. Measured through Stack Overflow and similar.
  • Number of job offers, in which the system is mentioned. Measured through Indeed and similar.
  • Number of profiles in professional networks, in which the system is mentioned. Measured through LinkedIn.

This ranking is certainly something to keep an eye on!

The Premature Return to SQL

In online communities, the NoSQL topic (much like the ORM topic) is a guarantee to stir emotions. Many emotions are stirred by evangelists on either side for ideological or marketing reasons. Here’s an interesting post by Alex Popescu, a passionate NoSQL and polyglot persistence evangelist, claiming that the recent trend to return to SQL is premature:

This post triggered an equally interesting reaction by Markus Winand, author of SQL Performance Explained:

It’s really interesting, how often people think in terms of “trends” that introduce novel paradigms, obsoleting all we had before. I believe that these are not trends, but experiments. I’ve blogged before that you should be wary when NoSQL vendors promise you to put an end to DBAs. Very few “new” solutions or paradigms have ever completely replaced or substituted their predecessors. Or, in Isaac Newton’s words:

If I have seen further it is by standing on the shoulders of giants.

We’re not “returning to SQL”, nor is such a return “premature”. Yes, there are some innovative thinkers who are teaching an old elephant new tricks, and that’s good. It’s also good that such innovative thinkers get a piece of the cake and make money with their inventions.

It is also true that big database vendors are not very innovative. But they don’t have to be. Their asset is reliability, predictability, stability. Oracle SQL will still support all its age-old legacy in 15 years, which makes it a safe choice for banks and insurance companies. If a NoSQL or NewSQL feature proves to be innovative and reliable, Oracle et al. will most certainly pick it up and integrate it into SQL. Clever NoSQL vendors thus already prepare for their exits.

This happens outside the world of databases, of course:

  • Scala is innovative and contributes to Java (Generics in Java 5, Lambdas in Java 8).
  • Open Source developers (e.g. those of JAX-RS) are innovative and contribute to JEE.
  • PostgreSQL is innovative and contributes to other SQL dialects and eventually the SQL standard.
  • Instagram is innovative and contributed to Facebook (“shit happens!”).
  • jOOQ is innovative and contributes to JDBC and JPA (eventually, hopefully).

SQL is a safe bet and is here to stay.

MIT Prof. Michael Stonebraker: “The Traditional RDBMS Wisdom is All Wrong”

A very interesting talk about the future of DBMS was recently given at EPFL by MIT Professor and VoltDB Co-founder and CTO Michael Stonebraker, who also gave us Ingres and Postgres. In a bit less than one hour, he explains his views with respect to the three main pillars of database management systems:

  • OLAP / Data warehouses
  • OLTP
  • Other types of data stores

As a NewSQL vendor also actively involved with H-Store, he is of course heavily yet refreshingly biased towards traditional RDBMS storage models being obsolete (an interesting fact is that Oracle Labs representative Eric Sedlar also attended the talk. One might think that the talk was a slighly FUD-dy move against a VoltDB competitor). Unlike what has come to be known as the NoSQL movement, NewSQL relies on similar relational theory / set theory as “traditional SQL”, including support for ACID and structured data.

His claims mainly include that:

  • OLAP / data warehouses will migrate to column-based data stores within 10 years. The traditional row-based data storage approach is dead, as row-based storage will never match column-based storage’s performance increase by factor 100x.
  • For OLTP, the race for the best data storage designs has not yet been decided, but there is a clear indication of classic models being “plain wrong” (according to Stonebraker), as only 4% of wall-clock time is spent on useful data processing, while the rest is occupied with buffer pools, locking, latching, recovery.
Image from Stonebraker's presentation depicting the amount of "useful" work performed by any RDBMS
Image from Stonebraker’s presentation depicting the amount of “useful” work performed by any RDBMS

I specifically recommend the OLTP part of his talk, as it shows how various new techniques could heavily increase performance of traditional RDBMS already today:

  • Most OLTP systems can afford to buy the amount of memory needed to keep data off the disk. This will remove the need for a buffer pool.
  • Single-threading would get rid of the latching overhead. H-Store and VoltDB statically divide shared memory among the cores, for instance. This is very important as latching gets worse and worse with the increasing amount of cores we have, today.
  • Dynamic locking is not really implemented in any popular RDBMS, but the market is uncertain, which workaround best implements concurrency control. In his opinion, MVCC is not going to do the trick in the long run.
  • ACIDness is something that even Jeff Dean from Google admits to miss, once it’s gone, as eventual consistency does not really keep its promise.
  • In a cluster, active-active consistency management can increase log throughput by factor 3x, compared to active-passive logging. (active-active = transaction is run on every node, active-passive = transaction is run only on the master node, the log is sent to all slave nodes)
  • And also, very importantly, anti-caching is a good technique when the in-memory format matches the disk format, as traditional RDBMS spend a substantial amount of time converting disk data formats (blocks, sectors) into memory formats (actual data).

The essence of Stonebraker’s talk is that the “elephants” who currently dominate the market are too slow to react to all the NewSQL vendors’ innovations. It is a very exciting time for a database professional (some refer to them as data geeks) to enter the market and publish new findings.

Another interesting thing to note is that SQL (call it NewSQL, OldSQL) will remain a dominant language for querying DBMS, both for column-stores as for row-stores. This is a strong statement for tools like jOOQ, which embrace SQL as a first-class citizen among programming languages.

See the complete talk by Michael Stonebraker here:

See Stonebraker's Talk here: http://slideshot.epfl.ch/play/suri_stonebraker
See Stonebraker’s Talk here: http://slideshot.epfl.ch/play/suri_stonebraker

Further reading:

Pinterest and SQL vs. NoSQL

I’ve recently discovered a very interesting read about Pinterest‘s architecture experimentation. One of the key messages is the fact that SQL and NoSQL data storage systems can coexist with each of them having their place. Here’s the full article:

http://highscalability.com/blog/2013/4/15/scaling-pinterest-from-0-to-10s-of-billions-of-page-views-a.html

This reminds me of a previous article about Instagram successfully displaying how they implemented large-scale sharding on Postgres:

https://blog.jooq.org/2011/12/10/a-success-story-of-sql-scaling-horizontally/

A map of all those new NoSQL, NewSQL, post-SQL, structured, unstructured database options that came out over the past year

So you want to go with the flow and implement your next application on top of some NoSQL, NotJustSQL, NewSQL, AlmostSQL, SQL++, NextGenSQL, and what not, just to be sure not to miss out on some of the latest developments in the data business? Here’s a little map to guide you through the jungle of choices:

http://gigaom.com/cloud/confused-by-the-glut-of-new-databases-heres-a-map-for-you/

… or you just stick with the relational data model and some decent RDBMS like Oracle, SQL Server, or Postgres and wait until things settle a little bit :-)

“NoSQL” should be called “SQL with alternative storage models”

Time and again, you’ll find blog posts like this one here telling you the same “truths” about SQL vs. NoSQL:

http://onewebsql.com/blog/no-sql-do-i-really-need-it
(OneWebSQL being a competitor of jOOQ, see a previous article for a comparison)

Usually, those blogs aim for the same arguments being:

  • Performance (“SQL” can “never” scale as much as “NoSQL”)
  • ACID (you don’t always need it)
  • Schemalessness (just store any data)

For some funny reason, all of these ideas have led to the misleading term “NoSQL”, which is interpreted by some as being “no SQL”, by others as being “not only SQL”. But SQL really just means “Structured Query Language”, and it is extremely powerful in terms of expressing relational context. It is well-designed for ad-hoc creation of tuples, records, tables, sets and for mapping them to other projections, reducing them to custom aggregations, etc. Note the terms “map/reduce”, which are often employed by NoSQL evangelists.

For good reasons, the Facebook Query Language (FQL), one of the leading NoSQL query languages, closely resembles SQL although it operates on a completely different data model. Oracle too, has jumped on the “NoSQL” train and sells its own product. It won’t be very long until the two types of data storage will merge and can be queried by an ISO/IEEE standardised SQL:2015 (or so). Because the true spirit of “NoSQL” does not consist in the way data is queried. It consists in the way data is stored. NoSQL is all about data storage. So, sooner or later, you will just create “traditional” tables along with “graph tables” and “hashmap tables” in the same database and join them in single SQL queries without thinking much about today’s hype.

“NoSQL” should be called “SQL with alternative storage models” and queried with pure SQL!