MongoDB “Lightning Fast Aggregation” Challenged with Oracle

What does “Scale” even mean in the context of databases? When talking about scaling, people have jumped to the vendor-induced conclusion that:

  • SQL doesn’t scale
  • NoSQL scales

It is very obvious that NoSQL vendors make such claims. It has also been interesting that many NoSQL consumers made such claims, even if they probably confused SQL in general with MySQL in particular. They then go on about comparing MongoDB with MySQL scalability, which totally makes sense as MySQL is to SQL what MongoDB is to NoSQL.

Let’s get back down to earth…

… because in the last decades, there was no database to beat Oracle on the Transaction Processing Performance Council’s benchmarks. We trust that the CERN people have made an informed decision when they opted for Oracle Exadata and other Oracle products to manage their immensely huge data even before huge data was called Big Data.

So let’s make a quick comparison. Recently, Vlad Mihalcea has blogged about “lightning speed aggregation” (with MongoDB). He took a data set of 50 million records of the form:

created_on           | value
-------------------------------------------
2012-05-02T06:08:47Z | 0.9270193106494844
2012-09-06T22:40:25Z | 0.005334891844540834
2012-06-15T05:58:22Z | 0.05611344985663891
...                  | ...

These are random timestamps and random floats. He then aggregated this data with the following query:

var dataSet = db.randomData.aggregate([
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1
        }
    }
]);

And got a “mind-numbing”, lightning speed aggregation result:

Aggregation took:129.052s

129s for a medium-sized table with 50M records is lightning speed for MongoDB? Alright, we thought. Let’s try this with Oracle. Vlad had the courtesy to provide us with the sample data, which we imported into the following trivial Oracle table:

CREATE TABLESPACE aggregation_test
  DATAFILE 'aggregation_test.dbf'
  SIZE 2000M ONLINE;

CREATE TABLE aggregation_test (
  created_on TIMESTAMP NOT NULL,
  value NUMBER(22, 20) NOT NULL
)
TABLESPACE aggregation_test;

Now, let’s load that data with sqlldr into my single-core licenced Oracle XE 11gR2 instance:

OPTIONS(skip=1)
LOAD DATA
  INFILE randomData.csv
  APPEND
INTO TABLE aggregation_test
  FIELDS TERMINATED BY ','
(
  created_on DATE
    "YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"",
  value TERMINATED BY WHITESPACE
    "to_number(ltrim(rtrim(replace(:value,'.',','))))"
)

And then:

C:\oraclexe\app\oracle\product\11.2.0\server\bin\sqlldr.exe 
  userid=TEST/TEST 
  control=randomData.txt 
  log=randomData.log 
  parallel=true 
  silent=feedback
  bindsize=512000 
  direct=true

Loading the records took a while on my computer with an old disk:

Elapsed time was:     00:03:07.70
CPU time was:         00:01:09.82

I might be able to tune this one way or another, as SQL*Loader isn’t the fastest tool for the job. But let’s challenge the 129s for the aggregation. So far, we hadn’t specified any index, but that isn’t necessary as we’re aggregating the whole table with all 50M records. Let’s do this with the following equivalent query to Vlad’s:

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')

This took 32s on my machine with an obvious full table scan. Not impressive. Let’s try Vlad’s other query, filtering on a single hour, which executed in 209ms in his benchmark (what we believe is really not fast at all):

var dataSet = db.randomData.aggregate([
    {
        $match: {
            "created_on" : {
                $gte: fromDate,
                $lt : toDate
            }
        }
    },
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    },
                    "hour" : {
                        $hour : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1,
            "_id.hour" : 1
        }
    }
]);

Vlad generated a random date, which he logged as

Aggregating 
from Mon Jul 16 2012 00:00:00 GMT+0300
  to Mon Jul 16 2012 01:00:00 GMT+0300

So let’s use exactly the same dates. But first, we should create an index on AGGREGATION_TEST(CREATED_ON):

CREATE TABLESPACE aggregation_test_index
  DATAFILE 'aggregation_test_index1.dbf'
  SIZE 2000M ONLINE;

CREATE INDEX idx_created_on
ON aggregation_test(created_on)
TABLESPACE aggregation_test_index;

OK. Now let’s run the Oracle equivalent of Vlad’s query (note, there’s a new column in the GROUP BY, SELECT, ORDER BY clauses):

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
WHERE created_on
  BETWEEN TIMESTAMP '2012-07-16 00:00:00.0'
  AND     TIMESTAMP '2012-07-16 01:00:00.0'
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on)
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on);

This took 20s in a first run. To be fair, before running this statement, I cleared some caches. I believe that Vlad’s MongoDB was already “warmed up” for the second query:

alter system flush shared_pool;
alter system flush buffer_cache;

Because when I ran the same query again, it only took 0.02 seconds as Oracle’s buffer cache kicked in, preventing actual disk access. I might have tuned my Oracle instance in a way to keep the whole table in memory from the first query, in case of which Oracle would have beat MongoDB by an order of magnitude.

Another option is to tune indexing. Let’s remove our existing index and replace it with a “covering” index as such:

DROP INDEX idx_created_on;

CREATE INDEX idx_created_on_value
ON aggregation_test(created_on, value)
TABLESPACE aggregation_test_index;

Let’s again flush caches:

alter system flush shared_pool;
alter system flush buffer_cache;

And run the query…

  • First execution: 0.5s
  • Second execution: 0.005s

From the execution plan, I can now tell that the query didn’t need to access the table any longer. All relevant data was contained in the index:

plan-2

Now that starts getting impressive, right?

So, what conclusion can we draw from this?

Here are a couple of findings:

Don’t jump to conclusions

First off, we should be careful with jumping to any conclusion at all. We are comparing apples with oranges. We’d even be comparing apples with oranges when comparing Oracle with SQL Server. Both benchmarks were quickly written, hacked-up benchmarks that do not represent any productive situations. What we can say already now is that there is not a significant winner when aggregating data in a 50M records table, but Oracle seems to perform better very quickly on an SSD (MongoDB) vs. HDD (Oracle) disk benchmark!

To be fair, my computer has a better CPU than Vlad’s:

system
My Computer
Vlad’s Computer

Both SQL and NoSQL can scale

Much of the NoSQL scaling debate is FUD. People always want to believe vendors who tend to say “their product doesn’t do this and that”. Don’t buy it immediately. Run a smallish benchmark as the above and see for yourself, scaling up is not a problem for both SQL and NoSQL databases. Don’t be blinded by Oracle’s initial “slow” query execution. Oracle is a very sophisticated database that can tune itself according to actual need, thanks to its cost-based optimiser and its statistics.

And, nothing keeps you (and your DBA) from using all of your tool’s features. In this example, we have only scratched the surface. For instance, we might apply IOT (Index-organised Table) or table partitioning, if you’re using Oracle Enterprise Edition.

50M is not Big Data

CERN has Big Data. Google does. Facebook does. You don’t. 50M is not “Big Data”. It’s just your average database table.

200ms is not fast

Obviously, you might not care too much how long it takes to prepare your offline report on a dedicated batch reporting server. But if you have a real user on a real UI waiting for it… If they do, 200ms is not fast, because they might be running 20 of these queries. And there might be 10’000 of these users. In a real OLTP system, having such reporting queries around even means that Oracle 0.005s aren’t fast!

So, in other words…

… don’t jump the SQL ship just yet!

16 thoughts on “MongoDB “Lightning Fast Aggregation” Challenged with Oracle

  1. Oracle is a great SQL engine, probably the best one, but scaling is not about performance. Scaling horizontally can save you a great deal of money. There are many small companies requiring to process large data volumes (TBs) and take a cost-driven decision.

    While this document was published by 10gen, it’s interesting to compare pricing:

    http://www.fstsummitus.com/media/whitepapers/2013/a_total_-10gen.pdf

    Even if it’s not 70% cheaper and it would just be 25%, it might still be attractive for some business owners.

    1. but scaling is not about performance. Scaling horizontally can save you a great deal of money

      Yes, and no. Scaling is about performance to many. But you’re right, scaling is about pricing to many others. That’s also EnterpriseDB’s main point, to get people to switch from Oracle to PostgreSQL (which is a much wiser long-term decision, if costs are a driving factor, in my opinion).

      That’s an interesting white paper by 10gen, thanks for sharing. Clearly there will be 1-2 caveats that they don’t mention. The most important one is the one about maintenance costs in 5-10 years. Not everyone builds software to last, but those who do might be reluctant to put all their data into a store like MongoDB

  2. I optimized my test and I managed to get a 0.09s with a cold index on a freshly started MongoDB, which goes to 0.04 when the index gets warmed up. I will soon post about it. It’s hard to break it further, as my sluggish machine is becoming the bottleneck. But the “tuning” experience is what that matters most.

        1. Hah, yeah. Well, as others have pointed out, I should reproduce your own benchmark on my own machine to remove all side-effects. I believe that the biggest slowdown on my side was the disk. Would be interesting to see how fast the queries would run on your SSD…

          Anyway, it was a fun comparison of apples and oranges :-)

  3. just a note but, that second query would not normally need to be cached in mongo for those results – only indexed.

  4. I am trying to get this “lightning fast” aggregation speed out of MongoDB, and I am constantly failing to do this.
    I use quite simple queries with one $match and grouping by a couple of fields plus one aggregation function.
    Storing the same data structure in one table of MySQL, and applying functionally same SQL query is twice faster, if not better.
    Not to mention that disk space occupied by the database is around 30% less in case of MySQL.
    Still have to check this in case of big tables (300M records, about 300Gb of data), but not expecting any performance magic here too.

  5. Even 32 seconds for the full scan can be improved upon with a little parallel processing (the test below also done on an oldish laptop with an aging ssd)

    SQL> select /*+ parallel(t,4) */
      2    extract(year from d),
      3    to_char(d, 'ddd'),
      4    count(*),
      5    avg(val),
      6    min(val),
      7    max(val)
      8  from t
      9  group by
     10    extract(year from d),
     11    to_char(d, 'ddd')
     12  order by 1,2;
    
    EXTRACT(YEARFROMD) TO_   COUNT(*)   AVG(VAL)   MIN(VAL)   MAX(VAL)
    ------------------ --- ---------- ---------- ---------- ----------
                  2012 002      53223 2.59454866 2.00495952  2.9955543
                  2012 003     132706 3.29670618 2.82846786 3.99565143
                  2012 004     188570 3.99640047 3.46416238 4.99899953
                  2012 005     229162 4.64578079 4.00020857 5.99627723
    ...
                  2012 358       1578 37.8137494 37.7890134 37.8419413
                  2012 359       1104 37.8663212 37.8419363 37.8944884
                  2012 360        729 37.9168234 37.8946115 37.9471221
                  2012 361        244 37.9635473 37.9474661 37.9913291
    
    360 rows selected.
    
    Elapsed: 00:00:17.65
    SQL>
    

Leave a Reply to vladmihalceaCancel reply