When to Choose SQL and When to Choose NoSQL


Some people make architecture decisions purely based on the loudest consultant:

For most others, however, decisions are not as simple as this. For instance: When should we start evaluating NoSQL storage systems as an alternative to RDBMS?

RDBMS as the hammer for all nails

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

The same is true for XML/JSON in the database (don’t forget, JSON is just XML with less syntax and less features, so it’s less awesome than XML). Sometimes you need to actually store document structures (hierarchical data) in their hierarchical form in your database, instead of normalising them. Sure, you could normalise a document, but that’s a lot of work and not necessarily beneficial. This was actually the question on Twitter by Gareth, that triggered the subsequent discussion:

Most modern RDBMS offer means to store (and much more importantly, query) XML/JSON data structures in simple forms. This includes PostgreSQL, Oracle, DB2, SQL Server and others.

So, when do we decide to switch?

As developers, we may be inclined to switch quite quickly. E.g. when working with graphs, we’d love to use Neo4j which implements the awesome Cypher query language. When working with JSON, we’d love to use something like Couchbase which implements the interesting N1QL query language. Both of these query languages are heavily inspired by SQL, which were wise vendor choices in my opinion (as opposed to MongoDB’s weird JSON-based query language), because in the end, the SQL language is still the most powerful and popular 4GL that was ever created.

But as developers, we shouldn’t make that decision lightly. Yes, at first, these specialised databases seem like a better fit. But the additional cost for operations teams to maintain, monitor, patch, tune production systems must not be underestimated. This is even true within the RDBMS ecosystem. A prominent recent example was Uber’s switch (back!) from PostgreSQL to MySQL:

Why Uber Engineering Switched from Postgres to MySQL

Do note that they switched the other way round before, only to regret things. Truth is, there are tons of reasons why your operations teams prefer to always use the same database, even if that’s quite expensive in terms of licensing. In many cases, however, it would be even more expensive to:

  • Engage in additional license and/or support contracts with new database vendors
  • Find skilled DBA for the new technology (can be very hard with niche databases)
  • Maintain two data silos and possibly sustain the cost of keeping them in sync

Ultimately, there’s a threshold:

In terms of using JSON in the database, it’s simple:

  • Occasional JSON storage: Stick with RDBMS.
  • Everything is JSON: Perhaps not RDBMS.

The same is true for graph problems. SQL can totally handle graphs and traverse them with recursive SQL. Here’s a funky statement that recursively calculates the subset sum problem:

subset-sum

Full article here: How to Find the Closest Subset Sum with SQL

If you only have a few tree/graph traversals to calculate (e.g. for a simple menu structure), don’t jump ship with RDBMS yet. If graphs are your business and that’s all you do, then probably RDBMS aren’t the right choice.

Conclusion

Whatever problem you’re solving. Do remember. Yes, all you have is a hammer, every problem starts looking like a nail. But don’t dismiss RDBMS as a silly hammer. It is a very mighty hammer and in 2016, it can do a lot of non-relational niche things decently well.

RDBMS is still the best default choice for all sorts of data problems. Only once you have gone above a certain threshold (or if you can foresee doing that), then you should look for alternatives. Because those alternatives will give you a much rougher time when you go outside of that niche (JSON, graphs, etc.) back to your “ordinary” relational business.

The 10 Most Popular DB Engines (SQL and NoSQL) in 2015


About two years ago, we’ve published this post about the 10 most popular DB engines, where we analyzed the data published by Solid IT on their DB Ranking website.

In the meantime, the Solid IT measurement system has found to be a credible source, such that the website has also been cited at Gartner, InfoWorld, and many other sources. For more details about how this is measured, check out the relevant website on db-engines.com:
http://db-engines.com/en/ranking_definition

Comparing the top 10 list, we can see that the players have shifted, heavily:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

The top 3 elefants are still Oracle, MySQL and Microsoft SQL Server, but the runner-ups have changed. While PostgreSQL is still gaining traction, it has lost grounds compared to MongoDB.

Also, Cassandra and Redis have pushed out Sybase and Teradata from the top 10!

When 2 years ago, there had been only a single non RDBMS in this top 10 list, there are now 3, all of which “schema-less”, and they’re gaining additional momentum.

Clearly, vendors of RDBMS will need to move quickly to accommodate the needs of document storage and key-value storage the way their new competitors do.

For us on the jOOQ blog, a much more interesting perspective is to see where our supported databases currently are in this ranking:

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

With the recent release of jOOQ 3.7, we’ve added another three databases to our list of now 21 supported RDBMS. Compared to last year’s ranking, almost all of these RDBMS are gaining traction as well, apart from SQL Server.

One thing is certain: We still have very exciting times ahead. Stay tuned for more updates, and check out the current ranking here:

http://db-engines.com/en/ranking

“NoSQL Injection” – What 40000 Unsecured MongoDB Databases Mean for our Industry


The news is all over reddit

Major security alert as 40,000 MongoDB databases left unsecured on the internet

Security is a feature that is often neglected until it’s too late. And when it’s too late, it is often hard to bake it into a well-established architecture without major refactoring efforts.

Every system and thus also every database is always vulnerable. Most databases, however, do offer a significant amount of features to implement a security layer – and MongoDB is no different from any other DBMS here. So, how could this massive security hole happen?

Security is a cultural thing. Either, a company has security in their DNA, or it doesn’t. The same is true for scalability, or user experience, or any other aspect of software engineering. I’ve worked for companies that are at completely opposite ends of security awareness. Some (in the E-Banking field) were ultra-paranoiac, implementing thorough security checks in around 7 layers of the application. Others were rather lenient with management focusing much more on marketing than anything else. Without any empirical evidence, however, there was a certain correlation between security-awareness in a company and the backend-orientedness of the same company, E-Banking being a very backend-oriented business.

Backend developers are more security aware

This is an over-generalisation and probably doesn’t do justice to many excellent frontend developers out there, but security is where the data is. Where the algorithms are. Where people reason about constraints, workflows, batch jobs, accounting, money, … algorithms. These folks focus on all the users. On the system. And they want to protect it. On the flip side, they might neglect usability.

There is only little security-awareness where the user experience is. Where people reason about layout, formatting, usability, style, … user interfaces. These folks focus on single users. On their experience. And they want to make things easy for the user.

(and again, the same is true for scalability)

It is no coincidence that backend technology evolves extremely slowly. Java: 20 years and we’ve just finally gotten lambdas. SQL: 30 years and we still don’t have easy ways to reuse code.

At the same time, frontend technology evolves at the “speed of reddit”. The next hype is just 100 karma away, and we’ll throw all the previous tech out of the window, just to be part of the game.

Clearly, security is something that has to be reasoned about way too thoroughly for it to survive in the fast-paced frontend world.

What does MongoDB have to do with it?

The current event isn’t actually directly related to MongoDB (you could probably find just as many unprotected MySQL instances out there). But it strongly correlates with MongoDB’s sales and marketing strategies. MongoDB has done very aggressive and successful marketing in the past, claiming that the reign of the RDBMS is over – just as much as the reign of the RDBMS had been over before, when the astonishing object databases surfaced this planet. Well, we all know where object or XML databases went:

d8938bef47ea2f62ed0543dd9e35a483

This time, the anti-RDBMS marketing resonated mostly with frontend developers, obviously, because JSON is their favourite data representation format, and MongoDB promised to be able to store data directly from the DOM into the DB. Not only did this mean “the end of the DBA” for some software vendors, but many vendors also hoped that they could omit operations, and perhaps even backend development. What obviously worked well for prototyping and simple applications doesn’t scale well to applications with sensitive data.

The Solution

The solution is obvious. Homogeneity kills your business. You should hire a variety of different types of personnel. You should have skilled frontend developers, backend developers, operations people, DBA, and security experts on your team. You should make them work all together, hear each of their opinions, review each others’ code, learn from each other. Because each one of them has a strong focus and interest on an entirely different, yet equally important aspect of your application.

Do not neglect any of these aspects. Because if you do, and if it’s security, and if you lose sensitive customer data – well, you’re not going to stay in business, you’ll be sued in court.

Got hooked on the security topic?

Continue reading about …

Top 10 Very Very VERY Important Topics to Discuss


Some things are just very very very VERY very important. Such as John Cleese.

The same is true for Whitespace:

Yes. 1080 Reddit Karma points (so urgently needed!) in only 23 hours. That’s several orders of magnitudes better than any of our – what we wrongfully thought to be – very deep and interesting technical insight about Java and SQL has ever produced.

The topic of interest was a humourous treatise about whether this:

for (int i=0; i<LENGTH; i++)

… or this:

for (int i = 0; i < LENGTH; i++)

… should be preferred. Obviously both options are completely wrong. The right answer is:

for 
(   int i = 0
;   i < LENGTH
;   i++
)

Read the full treatise here.

But at some point, the whitespace discussion is getting stale. We need new very very very important topics to discuss instead of fixing them bugs. After all, the weekend is imminent, and we don’t know what else to talk about.

This is why we are now publishing…

Top 10 Very Very VERY Important Topics to Discuss

Here we go…

0. Whitespace

OK, that was a no-brainer. We’ve already had that. Want to participate? The very interesting Reddit discussion is still hot.

1. The Vietnam of Computer Science

In case you haven’t heard of this highly interesting discussion, there are some people who believe that ORMs are outdated, because ORMs don’t work as promised. And they’re totally right. And the best thing is, all the others are totally right as well.

Why is that great? Because that means we get to discuss it. Endlessly!

While everyone keeps talking about ORMs like that, no one cares what Gavin King (creator of Hibernate) had said from the beginning:

Why should we care about his opinion? We have our own, far superior opinion! Let’s have another discussion about why ORMs are evil!

2. Case-sensitivity

Unfortunately, us Java folks cannot have any of those very very very very very important discussions about casing, because unfortunately, Java is a case-sensitive language.

But take SQL (or PL/SQL, T-SQL for that sake). When writing SQL, we can have awesome discussions about whether we should

-- Upper case it all
SELECT TAB.COL1, TAB.COL2 FROM TAB

-- Upper case keywords, lower case identifiers
SELECT tab.col1, tab.col2 FROM tab

-- Lower case keywords, upper case identifiers
select TAB.COL1, TAB.COL2 from TAB

-- Lower case it all
select tab.col1, tab.col2 from tab

-- Add some PascalCase (awesome SQL Server!)
SELECT Tab.Col1, Tab.Col2 FROM Tab

-- Mix case-sensitivity with case-insensitivity
-- (Protip to piss off your coworkers: Name your
-- table "FROM" or "INTO" and let them figure out
-- how to query that table)
SELECT TAB."COL1", TAB."col2" FROM "FROM"

-- PascalCase keywords (wow, MS Access)
Select TAB.COL1, TAB.COL2 From TAB

Now that is really incredibly interesting. And because this is so interesting and important, you can only imagine the number of interesting discussions we’ve had on the jOOQ User Group, for instance, about how to best generate meta data from the database. With jOOQ, we promise that you can extend these enticing discussions from the SQL area to the Java area by overriding the code generator’s default behaviour:

  • Should classes be PascalCased and literals be UPPER_CASED?
  • Should everything be PascalCased and camelCased as in Java?
  • Should everything be generated as named in the database?

Endless interesting discussions!

jOOQ: The Best Way to Write SQL in Java

We have so many options to SQL casing, which brings us to

3. SQL formatting

Unlike C-style general-purpose languages such as C, Java, Scala, C#, or even keyword-heavy ones Delphi, Pascal, Ada, SQL has one more awesome grounds for numerous discussions. It is not only keyword-heavy, but it also has a very complex and highly irregular syntax. So we’re lucky enough to get to choose (after long discussions and settlements) between:

-- All on one line. Don't tell me about print margins,
-- Or I'll telefax you my SQL!
SELECT table1.col1, table1.col2 FROM table1 JOIN table2 ON table1.id = table2.id WHERE id IN (SELECT x FROM other_table)

-- "Main" keywords on new line
SELECT table1.col1, table1.col2 
FROM table1 JOIN table2 ON table1.id = table2.id 
WHERE id IN (SELECT x FROM other_table)

-- (almost) all keywords on new line
SELECT table1.col1, table1.col2 
FROM table1 
JOIN table2 
ON table1.id = table2.id 
WHERE id IN (SELECT x FROM other_table)

-- "Main" keywords on new line, others indented
SELECT table1.col1, table1.col2 
FROM table1 
  JOIN table2 
  ON table1.id = table2.id 
WHERE id IN (
  SELECT x 
  FROM other_table
)

-- "Main" keywords on new line, others heavily indented
SELECT table1.col1, table1.col2 
FROM table1 JOIN table2 
              ON table1.id = table2.id 
WHERE id IN (SELECT x 
             FROM other_table)

-- Doge formatting
SUCH table1.col1,
                 table1.col2
    MUCH table1
JOIN table2 WOW table1.id
            = table2.id
WHICH              id IN
   (SUCH x

WOW other_table
            )
Doge SQL Formatting

Doge SQL Formatting

And so on and so forth. Now any project manager should reserve at least 10 man-weeks in every project to agree on rules about SQL formatting.

4. The end of the DBA

Now THAT is a very interesting topic that is not only interesting for developers who are so knowledgeable about productive systems, no it’s also very interesting for operations teams. Because as we all know, the DBA is dead (again).

For those of you who have been missing out on this highly interesting topic, do know that all of this started (again) when the great NoSQL vs. SQL debate was initiated by brilliant minds and vendors of truly alternative systems. Which are now starting to implement SQL, because apparently, well… SQL isn’t all that bad:

Please, do engage in some more discussions about the best and only true way to tackle database problems. Because your opinion counts!

5. New lines and comments

Remember our own blog post about putting some keywords on new lines? Yes, we prefer:

// If this
if (something) {
    ...
}

// Else something else
else {
    ...
}

Exactly. Because this allows comments to be written where they belong: Next to the appropriate keyword, and always aligned at the same column. This leads us to the next very interesting question: Should we put comments in code at all? Or is clean code self-documenting?

And we say, why yes, of course we should comment. How on earth will anyone ever remember the rationale behind something like this??

// [#2744] DB2 knows the SELECT .. FROM FINAL 
// TABLE (INSERT ..) syntax
case DB2:

// Firebird and Postgres can execute the INSERT 
// .. RETURNING clause like a select clause. JDBC
// support is not implemented in the Postgres JDBC
// driver
case FIREBIRD:
case POSTGRES: {
    try {
        listener.executeStart(ctx);
        rs = ctx.statement().executeQuery();
        listener.executeEnd(ctx);
    }
    finally {
        consumeWarnings(ctx, listener);
    }

    break;
}

Taken from our “hacking JDBC” page.

6. JSON is totally better than XML

Of course it is! Because… because… errr. Because it allows me to structure data hierarchically. Waaaait a second…

Dayum.

You’re saying, JSON and XML are the SAME THING!?

But MongoDB and PostgreSQL allow me to store JSON. Oh wait. They tried to store XML in databases, back in the 90s, too!? And it failed? Well, of course it failed, because XML sucks, right? (which is essentially another way of saying that I’ve never understood XSLT or XQuery or XPath or didn’t even hear about XProc, and I’m just ranting about angle brackets and namespaces)

Let’s further discuss this matter. I feel that we’re close to the very ultimate solution on that topic.

Speaking of JSON…

7. Curly braces

OMG! This is the most interesting of all topics. Should we put the opening brace:

  • On the same line?
  • On a NEW line??
  • NO BRACE AT ALL???

The right answers are 1) and 3). 1) only if we absolutely have to, as in try or switch statements. We’re not paid by the number of lines of code, so we don’t add meaningless lines with only opening braces. And if we can omit the braces entirely, fine. Here’s an awesome statement, if you ask me:

if (something)
    outer:
    for (String thing : list)
        inner:
        do
            if (somethingElse)
                break inner;
            else
                continue outer;
        while (true);

That ought to teach them juniors not to touch my code. Which brings us to:

8. Labels

Nothing wrong with them. I’ll break out of my loops any time I want. Don’t tell me labels are Java’s way of saying GOTO, they’re much more sophisticated than that. (Besides, goto is a reserved word in Java, and it is an actual bytecode instruction). So I’ll happily do my jumping forward:

label: {
  // do stuff
  if (check) break label;
  // do more stuff
}

Or my jumping backward:

label: do {
  // do stuff
  if (check) continue label;
  // do more stuff
  break label;
} while(true);

(observe how the above example used two spaces for indentation instead of four (or tabs). Another great topic for further discussion)

9. emacs vs. vim vs. vi vs. Eclipse vs. IntelliJ vs. Netbeans

Can we please, PLEASE, have another very interesting discussion about which one of these is better? Please!

10. Last but not Least: Is Haskell better than [your language]?

According to TIOBE, Haskell ranks 38.

And as we all know, the actual market share (absolutely none in the case of Haskell) of any programming language is inversely proportional to the amount of time spent on reddit discussing the importance of said language, and how said language is totally superior to the one ranking 1-2 above on TIOBE, for instance. Which would be Lua.

So, I would love to invite you to join our blogging friends below to a very very interesting discussion about…

Now, of course, we could enlargen the debate and compare functional programming with OO programming in general before delving into why Scala is NOT a functional programming language, let alone Java 8.

Oh, and you think your dialect of Haskell or Lisp is not good enough, so you should roll your own? Go ahead (right after checking this checklist!)

Such great topics. So little time.

Conclusion

The great thing about these social networks like Reddit, Hackernews, and all the others is the fact that we can finally spend all day to discuss really really intersting topics instead of fixing them boring bugs our boss wants us to fix. After all, this is IMPORTANT.

Or as Randall Munroe would say: “Duty calls!”

Further reading

If you’re now all hot and ready to discuss things, please consider also reading these very interesting and insightful articles on how to best format and style code:

Or add your own. There’s still much much important writing to do!

A Bit of SQL History


SQL has been around for a while. Just recently, I’ve read this refreshing article about Codd’s Relational Vision – Has NoSQL Come Full Circle? SQL is one of those awesome technologies that was invented almost two generations ago, and we’re still using it in production and for new projects. It is both legacy and state-of-the-art.

In the early 70s, Ingres (which still exists today) was built at the Berkely University of California. Parts of Ingres still go strong in today’s RDBMS market under the name of PostgreSQL, and many other databases. Michael Stonebraker has been involved with these early dinosaurs, and he’s currently promoting NewSQL storage, which isn’t listed in the above chart, unfortunately.

What’s also not listed, unfortunately, are all those various NoSQL storage systems, which will eventually merge in a greater timeline of DBMS, should they survive the upcoming second renaissance of Codd’s Vision. Love it or hate it, but SQL is a piece of software category that we will deal with for another generation of software engineers!

SQL and NoSQL are Really Just Two Sides of the Same Coin


In a recent debate about NoSQL vs. SQL on Hackernews, I was made aware of a quite amusing paper by Erik Meijer and Gavin Bierman. Remember, Erik Meijer has brought LINQ to the .NET universe, a formidable unified query DSL whose main purpose was to unify typesafe querying against XML, SQL, and object-oriented data structures.

It is important to note that LINQ surfaced shortly before the NoSQL hype really caught on, so NoSQL data structures (e.g. key / value stores, document stores, graph databases) were not yet in scope for LINQ, and LINQ providers might have had to tweak the odd implementation detail to fully match the LINQ API.

What Erik Meijer and Gavin Bierman are claiming in their article (which was also discussed intensively on Hackernews) is the fact that SQL and NoSQL are duals of each other, i.e. two sides of the same coin. In their quest to unify all query languages, the LINQ people would obviously love to simplify things to such a level. To us, who are focusing on SQL only via jOOQ, this seems more like a plea for LINQ than anything else. It would be all too nice if things were as easy as a simple duality, specifically given the fact that Erik Meijer has now also created a company called Applied Duality Inc… What we found very interesting, though, is Erik’s and Gavin’s hint about the relational model and other models inversing arrows between relationships. When child records point to parent records in the relational model, parent objects point to child objects in object-oriented design, or in XML.

But history will teach us where these things go. I currently don’t see a second E.F. Codd to solve the complexity introduced with the new abundance of NoSQL data stores – yet. But maybe, we’ll eventually remember Erik Meijer as the new E.F. Codd, bringing NoSQL to full circle.

MongoDB “Lightning Fast Aggregation” Challenged with Oracle


What does “Scale” even mean in the context of databases? When talking about scaling, people have jumped to the vendor-induced conclusion that:

  • SQL doesn’t scale
  • NoSQL scales

It is very obvious that NoSQL vendors make such claims. It has also been interesting that many NoSQL consumers made such claims, even if they probably confused SQL in general with MySQL in particular. They then go on about comparing MongoDB with MySQL scalability, which totally makes sense as MySQL is to SQL what MongoDB is to NoSQL.

Let’s get back down to earth…

… because in the last decades, there was no database to beat Oracle on the Transaction Processing Performance Council’s benchmarks. We trust that the CERN people have made an informed decision when they opted for Oracle Exadata and other Oracle products to manage their immensely huge data even before huge data was called Big Data.

So let’s make a quick comparison. Recently, Vlad Mihalcea has blogged about “lightning speed aggregation” (with MongoDB). He took a data set of 50 million records of the form:

created_on           | value
-------------------------------------------
2012-05-02T06:08:47Z | 0.9270193106494844
2012-09-06T22:40:25Z | 0.005334891844540834
2012-06-15T05:58:22Z | 0.05611344985663891
...                  | ...

These are random timestamps and random floats. He then aggregated this data with the following query:

var dataSet = db.randomData.aggregate([
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1
        }
    }
]);

And got a “mind-numbing”, lightning speed aggregation result:

Aggregation took:129.052s

129s for a medium-sized table with 50M records is lightning speed for MongoDB? Alright, we thought. Let’s try this with Oracle. Vlad had the courtesy to provide us with the sample data, which we imported into the following trivial Oracle table:

CREATE TABLESPACE aggregation_test
  DATAFILE 'aggregation_test.dbf'
  SIZE 2000M ONLINE;

CREATE TABLE aggregation_test (
  created_on TIMESTAMP NOT NULL,
  value NUMBER(22, 20) NOT NULL
)
TABLESPACE aggregation_test;

Now, let’s load that data with sqlldr into my single-core licenced Oracle XE 11gR2 instance:

OPTIONS(skip=1)
LOAD DATA
  INFILE randomData.csv
  APPEND
INTO TABLE aggregation_test
  FIELDS TERMINATED BY ','
(
  created_on DATE
    "YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"",
  value TERMINATED BY WHITESPACE
    "to_number(ltrim(rtrim(replace(:value,'.',','))))"
)

And then:

C:\oraclexe\app\oracle\product\11.2.0\server\bin\sqlldr.exe 
  userid=TEST/TEST 
  control=randomData.txt 
  log=randomData.log 
  parallel=true 
  silent=feedback
  bindsize=512000 
  direct=true

Loading the records took a while on my computer with an old disk:

Elapsed time was:     00:03:07.70
CPU time was:         00:01:09.82

I might be able to tune this one way or another, as SQL*Loader isn’t the fastest tool for the job. But let’s challenge the 129s for the aggregation. So far, we hadn’t specified any index, but that isn’t necessary as we’re aggregating the whole table with all 50M records. Let’s do this with the following equivalent query to Vlad’s:

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')

This took 32s on my machine with an obvious full table scan. Not impressive. Let’s try Vlad’s other query, filtering on a single hour, which executed in 209ms in his benchmark (what we believe is really not fast at all):

var dataSet = db.randomData.aggregate([
    {
        $match: {
            "created_on" : {
                $gte: fromDate,
                $lt : toDate
            }
        }
    },
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    },
                    "hour" : {
                        $hour : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1,
            "_id.hour" : 1
        }
    }
]);

Vlad generated a random date, which he logged as

Aggregating 
from Mon Jul 16 2012 00:00:00 GMT+0300
  to Mon Jul 16 2012 01:00:00 GMT+0300

So let’s use exactly the same dates. But first, we should create an index on AGGREGATION_TEST(CREATED_ON):

CREATE TABLESPACE aggregation_test_index
  DATAFILE 'aggregation_test_index1.dbf'
  SIZE 2000M ONLINE;

CREATE INDEX idx_created_on
ON aggregation_test(created_on)
TABLESPACE aggregation_test_index;

OK. Now let’s run the Oracle equivalent of Vlad’s query (note, there’s a new column in the GROUP BY, SELECT, ORDER BY clauses):

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
WHERE created_on
  BETWEEN TIMESTAMP '2012-07-16 00:00:00.0'
  AND     TIMESTAMP '2012-07-16 01:00:00.0'
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on)
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on);

This took 20s in a first run. To be fair, before running this statement, I cleared some caches. I believe that Vlad’s MongoDB was already “warmed up” for the second query:

alter system flush shared_pool;
alter system flush buffer_cache;

Because when I ran the same query again, it only took 0.02 seconds as Oracle’s buffer cache kicked in, preventing actual disk access. I might have tuned my Oracle instance in a way to keep the whole table in memory from the first query, in case of which Oracle would have beat MongoDB by an order of magnitude.

Another option is to tune indexing. Let’s remove our existing index and replace it with a “covering” index as such:

DROP INDEX idx_created_on;

CREATE INDEX idx_created_on_value
ON aggregation_test(created_on, value)
TABLESPACE aggregation_test_index;

Let’s again flush caches:

alter system flush shared_pool;
alter system flush buffer_cache;

And run the query…

  • First execution: 0.5s
  • Second execution: 0.005s

From the execution plan, I can now tell that the query didn’t need to access the table any longer. All relevant data was contained in the index:

plan-2

Now that starts getting impressive, right?

So, what conclusion can we draw from this?

Here are a couple of findings:

Don’t jump to conclusions

First off, we should be careful with jumping to any conclusion at all. We are comparing apples with oranges. We’d even be comparing apples with oranges when comparing Oracle with SQL Server. Both benchmarks were quickly written, hacked-up benchmarks that do not represent any productive situations. What we can say already now is that there is not a significant winner when aggregating data in a 50M records table, but Oracle seems to perform better very quickly on an SSD (MongoDB) vs. HDD (Oracle) disk benchmark!

To be fair, my computer has a better CPU than Vlad’s:

system

My Computer

 

Vlad’s Computer

 

Both SQL and NoSQL can scale

Much of the NoSQL scaling debate is FUD. People always want to believe vendors who tend to say “their product doesn’t do this and that”. Don’t buy it immediately. Run a smallish benchmark as the above and see for yourself, scaling up is not a problem for both SQL and NoSQL databases. Don’t be blinded by Oracle’s initial “slow” query execution. Oracle is a very sophisticated database that can tune itself according to actual need, thanks to its cost-based optimiser and its statistics.

And, nothing keeps you (and your DBA) from using all of your tool’s features. In this example, we have only scratched the surface. For instance, we might apply IOT (Index-organised Table) or table partitioning, if you’re using Oracle Enterprise Edition.

50M is not Big Data

CERN has Big Data. Google does. Facebook does. You don’t. 50M is not “Big Data”. It’s just your average database table.

200ms is not fast

Obviously, you might not care too much how long it takes to prepare your offline report on a dedicated batch reporting server. But if you have a real user on a real UI waiting for it… If they do, 200ms is not fast, because they might be running 20 of these queries. And there might be 10’000 of these users. In a real OLTP system, having such reporting queries around even means that Oracle 0.005s aren’t fast!

So, in other words…

… don’t jump the SQL ship just yet!