“NoSQL Injection” – What 40000 Unsecured MongoDB Databases Mean for our Industry

The news is all over reddit

Major security alert as 40,000 MongoDB databases left unsecured on the internet

Security is a feature that is often neglected until it’s too late. And when it’s too late, it is often hard to bake it into a well-established architecture without major refactoring efforts.

Every system and thus also every database is always vulnerable. Most databases, however, do offer a significant amount of features to implement a security layer – and MongoDB is no different from any other DBMS here. So, how could this massive security hole happen?

Security is a cultural thing. Either, a company has security in their DNA, or it doesn’t. The same is true for scalability, or user experience, or any other aspect of software engineering. I’ve worked for companies that are at completely opposite ends of security awareness. Some (in the E-Banking field) were ultra-paranoiac, implementing thorough security checks in around 7 layers of the application. Others were rather lenient with management focusing much more on marketing than anything else. Without any empirical evidence, however, there was a certain correlation between security-awareness in a company and the backend-orientedness of the same company, E-Banking being a very backend-oriented business.

Backend developers are more security aware

This is an over-generalisation and probably doesn’t do justice to many excellent frontend developers out there, but security is where the data is. Where the algorithms are. Where people reason about constraints, workflows, batch jobs, accounting, money, … algorithms. These folks focus on all the users. On the system. And they want to protect it. On the flip side, they might neglect usability.

There is only little security-awareness where the user experience is. Where people reason about layout, formatting, usability, style, … user interfaces. These folks focus on single users. On their experience. And they want to make things easy for the user.

(and again, the same is true for scalability)

It is no coincidence that backend technology evolves extremely slowly. Java: 20 years and we’ve just finally gotten lambdas. SQL: 30 years and we still don’t have easy ways to reuse code.

At the same time, frontend technology evolves at the “speed of reddit”. The next hype is just 100 karma away, and we’ll throw all the previous tech out of the window, just to be part of the game.

Clearly, security is something that has to be reasoned about way too thoroughly for it to survive in the fast-paced frontend world.

What does MongoDB have to do with it?

The current event isn’t actually directly related to MongoDB (you could probably find just as many unprotected MySQL instances out there). But it strongly correlates with MongoDB’s sales and marketing strategies. MongoDB has done very aggressive and successful marketing in the past, claiming that the reign of the RDBMS is over – just as much as the reign of the RDBMS had been over before, when the astonishing object databases surfaced this planet. Well, we all know where object or XML databases went:

d8938bef47ea2f62ed0543dd9e35a483

This time, the anti-RDBMS marketing resonated mostly with frontend developers, obviously, because JSON is their favourite data representation format, and MongoDB promised to be able to store data directly from the DOM into the DB. Not only did this mean “the end of the DBA” for some software vendors, but many vendors also hoped that they could omit operations, and perhaps even backend development. What obviously worked well for prototyping and simple applications doesn’t scale well to applications with sensitive data.

The Solution

The solution is obvious. Homogeneity kills your business. You should hire a variety of different types of personnel. You should have skilled frontend developers, backend developers, operations people, DBA, and security experts on your team. You should make them work all together, hear each of their opinions, review each others’ code, learn from each other. Because each one of them has a strong focus and interest on an entirely different, yet equally important aspect of your application.

Do not neglect any of these aspects. Because if you do, and if it’s security, and if you lose sensitive customer data – well, you’re not going to stay in business, you’ll be sued in court.

Got hooked on the security topic?

Continue reading about …

Stop Claiming that you’re Using a Schemaless Database

One of MongoDB’s arguments when evangelising MongoDB is the fact that MongoDB is a “schemaless” database:

Why Schemaless?

MongoDB is a JSON-style data store. The documents stored in the database can have varying sets of fields, with different types for each field.

And that’s true. But it doesn’t mean that there is no schema. There are in fact various schemas:

  • The one in your head when you designed the data structures
  • The one that your database really implemented to store your data structures
  • The one you should have implemented to fulfill your requirements

Every time you realise that you made a mistake (see point three above), or when your requirements change, you will need to migrate your data. Let’s review again MongoDB’s point of view here:

With a schemaless database, 90% of the time adjustments to the database become transparent and automatic. For example, if we wish to add GPA to the student objects, we add the attribute, resave, and all is well — if we look up an existing student and reference GPA, we just get back null. Further, if we roll back our code, the new GPA fields in the existing objects are unlikely to cause problems if our code was well written.

Everything above is true as well.

“Schema-less” vs. “Schema-ful”

But let’s translate this to SQL (or use any other “schema-ful” database instead):

ALTER TABLE student ADD gpa VARCHAR(10);

And we’re done! Gee, we’ve added a column, and we’ve added it to ALL rows. It was transparent. It was automatic. We “just get back null” on existing students. And we can even “roll back our code”:

ALTER TABLE student DROP gpa;

Not only are the existing objects unlikely to cause problems, we have actually rolled back our code AND database.

Let’s summarise:

  • We can do exactly the same in “schema-less” databases as we can in “schema-ful” ones
  • We guarantee that a migration takes place (and it’s instant, too)
  • We guarantee data integrity when we roll back the change

What about more real-world DDL?

Of course, at the beginning of projects, when they still resemble the typical cat/dog/pet-shop, book/author/library sample application, we’ll just be adding columns. But what happens if we need to change the student-teacher 1:N relationship into a student-teacher M:N relationship? Suddenly, everything changes, and not only will the relational data model prove far superior to a hierarchical one that just yields tons of data duplication, it’ll also be moderately easy to migrate, and the outcome is guaranteed to be correct and tidy!

CREATE TABLE student_to_teacher 
AS
SELECT id AS student_id, teacher_id
FROM student;

ALTER TABLE student DROP teacher_id;

… and we’re done! (of course, we’d be adding constraints and indexes)

Think about the tedious task that you’ll have transforming your JSON to the new JSON. You don’t even have XSLT or XQuery for the task, only JavaScript!

Let’s face the truth

Schemalessness is about a misleading term as much as NoSQL is:

And again, MongoDB’s blog post is telling the truth (and an interesting one, too):

Generally, there is a direct analogy between this “schemaless” style and dynamically typed languages. Constructs such as those above are easy to represent in PHP, Python and Ruby. What we are trying to do here is make this mapping to the database natural.

When you say “schemaless”, you actually say “dynamically typed schema” – as opposed to statically typed schemas as they are available from SQL databases. JSON is still a completely schema free data structure standard, as opposed to XML which allows you to specify XSD if you need, or operate on document-oriented, “schema-less” (i.e. dynamically typed) schemas.

(And don’t say there’s json-schema. That’s a ridiculous attempt to mimick XSD)

This is important to understand! You always have a schema, even if you don’t statically type it. If you’re writing JavaScript, you still have types, which you have to be fully aware of in your mental model of the code. Except that there’s no compiler (or IDE) that can help you infer the types with 100% certainty.

An example:

… and more:

So, there’s absolutely nothing that is really easier with “schemaless” databases than with “schemaful” ones. You just defer the inevitable work of sanitising your schema to some other later time, a time when you might care more than today, or a time when you’re lucky enough to have a new job and someone else does the work for you. You might have believed MongoDB, when they said that “objects are unlikely to cause problems”.

But let me tell you the ugly truth:

Anything that can possibly go wrong, does

Murphy

We wish you good luck with your dynamically typed languages and your dynamically typed database schemas – while we’ll stick with type safe SQL.

jOOQ: The best way to write SQL in Java

The DBMS of the Year 2013

We have recently blogged about the DB-Engines Ranking and how MongoDB was the only NoSQL store to make it into that ranking’s top 10. Today, this marketing platform offered by solid IT has announced MongoDB to be the DBMS of the year 2013, with PostgreSQL being a close runner-up, followed by Cassandra.

solid IT as a company is slightly biased towards NoSQL, so it’s not surprising that two NoSQL databases are in their top ranking, and the only successful ORDBMS in the market is number two. As we ourselves are “slightly” biased towards SQL, we would like to announce our own DBMS of the year 2013:

SQL Server is the DBMS of the year 2013

… because its SQL dialect Transact-SQL (which Microsoft “shares” with Sybase), is the first SQL-based programming language to make it into TIOBE’s top 10 programming languages.

Congratulations to SQL Server from the jOOQ team!

MongoDB “Lightning Fast Aggregation” Challenged with Oracle

What does “Scale” even mean in the context of databases? When talking about scaling, people have jumped to the vendor-induced conclusion that:

  • SQL doesn’t scale
  • NoSQL scales

It is very obvious that NoSQL vendors make such claims. It has also been interesting that many NoSQL consumers made such claims, even if they probably confused SQL in general with MySQL in particular. They then go on about comparing MongoDB with MySQL scalability, which totally makes sense as MySQL is to SQL what MongoDB is to NoSQL.

Let’s get back down to earth…

… because in the last decades, there was no database to beat Oracle on the Transaction Processing Performance Council’s benchmarks. We trust that the CERN people have made an informed decision when they opted for Oracle Exadata and other Oracle products to manage their immensely huge data even before huge data was called Big Data.

So let’s make a quick comparison. Recently, Vlad Mihalcea has blogged about “lightning speed aggregation” (with MongoDB). He took a data set of 50 million records of the form:

created_on           | value
-------------------------------------------
2012-05-02T06:08:47Z | 0.9270193106494844
2012-09-06T22:40:25Z | 0.005334891844540834
2012-06-15T05:58:22Z | 0.05611344985663891
...                  | ...

These are random timestamps and random floats. He then aggregated this data with the following query:

var dataSet = db.randomData.aggregate([
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1
        }
    }
]);

And got a “mind-numbing”, lightning speed aggregation result:

Aggregation took:129.052s

129s for a medium-sized table with 50M records is lightning speed for MongoDB? Alright, we thought. Let’s try this with Oracle. Vlad had the courtesy to provide us with the sample data, which we imported into the following trivial Oracle table:

CREATE TABLESPACE aggregation_test
  DATAFILE 'aggregation_test.dbf'
  SIZE 2000M ONLINE;

CREATE TABLE aggregation_test (
  created_on TIMESTAMP NOT NULL,
  value NUMBER(22, 20) NOT NULL
)
TABLESPACE aggregation_test;

Now, let’s load that data with sqlldr into my single-core licenced Oracle XE 11gR2 instance:

OPTIONS(skip=1)
LOAD DATA
  INFILE randomData.csv
  APPEND
INTO TABLE aggregation_test
  FIELDS TERMINATED BY ','
(
  created_on DATE
    "YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"",
  value TERMINATED BY WHITESPACE
    "to_number(ltrim(rtrim(replace(:value,'.',','))))"
)

And then:

C:\oraclexe\app\oracle\product\11.2.0\server\bin\sqlldr.exe 
  userid=TEST/TEST 
  control=randomData.txt 
  log=randomData.log 
  parallel=true 
  silent=feedback
  bindsize=512000 
  direct=true

Loading the records took a while on my computer with an old disk:

Elapsed time was:     00:03:07.70
CPU time was:         00:01:09.82

I might be able to tune this one way or another, as SQL*Loader isn’t the fastest tool for the job. But let’s challenge the 129s for the aggregation. So far, we hadn’t specified any index, but that isn’t necessary as we’re aggregating the whole table with all 50M records. Let’s do this with the following equivalent query to Vlad’s:

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')

This took 32s on my machine with an obvious full table scan. Not impressive. Let’s try Vlad’s other query, filtering on a single hour, which executed in 209ms in his benchmark (what we believe is really not fast at all):

var dataSet = db.randomData.aggregate([
    {
        $match: {
            "created_on" : {
                $gte: fromDate,
                $lt : toDate
            }
        }
    },
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    },
                    "hour" : {
                        $hour : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1,
            "_id.hour" : 1
        }
    }
]);

Vlad generated a random date, which he logged as

Aggregating 
from Mon Jul 16 2012 00:00:00 GMT+0300
  to Mon Jul 16 2012 01:00:00 GMT+0300

So let’s use exactly the same dates. But first, we should create an index on AGGREGATION_TEST(CREATED_ON):

CREATE TABLESPACE aggregation_test_index
  DATAFILE 'aggregation_test_index1.dbf'
  SIZE 2000M ONLINE;

CREATE INDEX idx_created_on
ON aggregation_test(created_on)
TABLESPACE aggregation_test_index;

OK. Now let’s run the Oracle equivalent of Vlad’s query (note, there’s a new column in the GROUP BY, SELECT, ORDER BY clauses):

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
WHERE created_on
  BETWEEN TIMESTAMP '2012-07-16 00:00:00.0'
  AND     TIMESTAMP '2012-07-16 01:00:00.0'
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on)
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on);

This took 20s in a first run. To be fair, before running this statement, I cleared some caches. I believe that Vlad’s MongoDB was already “warmed up” for the second query:

alter system flush shared_pool;
alter system flush buffer_cache;

Because when I ran the same query again, it only took 0.02 seconds as Oracle’s buffer cache kicked in, preventing actual disk access. I might have tuned my Oracle instance in a way to keep the whole table in memory from the first query, in case of which Oracle would have beat MongoDB by an order of magnitude.

Another option is to tune indexing. Let’s remove our existing index and replace it with a “covering” index as such:

DROP INDEX idx_created_on;

CREATE INDEX idx_created_on_value
ON aggregation_test(created_on, value)
TABLESPACE aggregation_test_index;

Let’s again flush caches:

alter system flush shared_pool;
alter system flush buffer_cache;

And run the query…

  • First execution: 0.5s
  • Second execution: 0.005s

From the execution plan, I can now tell that the query didn’t need to access the table any longer. All relevant data was contained in the index:

plan-2

Now that starts getting impressive, right?

So, what conclusion can we draw from this?

Here are a couple of findings:

Don’t jump to conclusions

First off, we should be careful with jumping to any conclusion at all. We are comparing apples with oranges. We’d even be comparing apples with oranges when comparing Oracle with SQL Server. Both benchmarks were quickly written, hacked-up benchmarks that do not represent any productive situations. What we can say already now is that there is not a significant winner when aggregating data in a 50M records table, but Oracle seems to perform better very quickly on an SSD (MongoDB) vs. HDD (Oracle) disk benchmark!

To be fair, my computer has a better CPU than Vlad’s:

system

My Computer

 

Vlad’s Computer

 

Both SQL and NoSQL can scale

Much of the NoSQL scaling debate is FUD. People always want to believe vendors who tend to say “their product doesn’t do this and that”. Don’t buy it immediately. Run a smallish benchmark as the above and see for yourself, scaling up is not a problem for both SQL and NoSQL databases. Don’t be blinded by Oracle’s initial “slow” query execution. Oracle is a very sophisticated database that can tune itself according to actual need, thanks to its cost-based optimiser and its statistics.

And, nothing keeps you (and your DBA) from using all of your tool’s features. In this example, we have only scratched the surface. For instance, we might apply IOT (Index-organised Table) or table partitioning, if you’re using Oracle Enterprise Edition.

50M is not Big Data

CERN has Big Data. Google does. Facebook does. You don’t. 50M is not “Big Data”. It’s just your average database table.

200ms is not fast

Obviously, you might not care too much how long it takes to prepare your offline report on a dedicated batch reporting server. But if you have a real user on a real UI waiting for it… If they do, 200ms is not fast, because they might be running 20 of these queries. And there might be 10’000 of these users. In a real OLTP system, having such reporting queries around even means that Oracle 0.005s aren’t fast!

So, in other words…

… don’t jump the SQL ship just yet!

jOOQ Newsletter October 10, 2013

Subscribe to this newsletter here.

jOOQ 3.2 Released

After a bit of time, jOOQ 3.2 has finally been released. This interesting release mainly includes two new SPIs (Service Provider Interfaces), which allow for:

  • Injecting pre and post CRUD operation behaviour, which is useful for global ID generators.
  • Injecting behaviour into the SQL rendering lifecycle allowing for arbitrary SQL transformations. This is very useful for things like shared-schema multi-tenancy.

Besides, the code generator has seen lots of improvements. More details can be seen in the release notes.

jOOQ Licensing and Support

With the new jOOQ 3.2, apart from introducing great new features, we are changing quite a few things on how we operate. At Data Geekery GmbH, we believe in Open Source. But we also believe in the creative power enabled by commercial software. This is why we have chosen to implement a dual-licensing strategy. Read more about this strategy here:

https://blog.jooq.org/2013/10/09/jooq-3-2-offering-commercial-licensing-and-support

SQL2jOOQ

A very interesting open source tool is being developed by a third-party vendor called GUDU Soft in cooperation with Data Geekery: SQL2jOOQ. This tool is based on GUDU Soft’s SQL Parser application, which can parse a variety of text-based SQL strings, transform ASTs and render new dialects. One of these output dialects is jOOQ. For jOOQ developers migrating large amounts of legacy SQL to jOOQ, this will be an invaluable tool in the tool chain.

MongoDB and NoSQL Heat

MongoDB was in the news big time last week when they announced their recent raise of $150 million in a venture-funding round. While cynics claimed that they failed to write the transaction to disk, this is still a very important milestone for 10gen. After all, MongoDB is the only non-relational DBMS figuring in a quite objective top 10 rating considering 194 systems.

To SQL or not to SQL? Our take from the jOOQ perspective is clear. Any competitor in the market dominated by Oracle and SQL Server is good, even (or maybe because) it is a non-SQL vendor:

https://blog.jooq.org/2013/10/02/the-premature-return-to-sql/

On MongoDB’s Success. Or Do Not Let Cynicism Kill Your Spirit

Success is a strange beast. On the bright side and true to entrepreneurial spirit, people admire those who are obviously successful. In the theory of our mostly capitalist societey, successful people worked hard for their success, and thus deserve it. In practice, there may be other mechanisms of success than working hard, but that’s another story which is quite off-topic on this blog.

MongoDB is successful. Extremely successful, as they have just raised $150M from a venture-funding round. They must be doing something right, at least from a marketing and sales perspective. As I admire this success, and as I am hoping to be similarly successful with jOOQ, I wanted to throw in the story at reddit for discussion. I’m curious what other business-oriented people think of this. But as always, when talking about MongoDB, cynicism arose. Heavily. See the following examples

Cynicism about the tech quality

Cynicism on the tech quality

Cynicism about the tech quality

Cynicism about getting rich

Cynicism about getting rich

Cynicism about getting rich

This was really disappointing to me. Is this just reddit being full of trolls? Or are people heavily envious if someone stands out and is successful? You may think about MongoDB whatever you want. But there is no doubt that this valuation and venture-funding round is a proof of MongoDB’s continuous success over the past years. Apparently, they are responding to a real market value with real market expectations. Raising this money will hopefully help them meet those expectations even more, in the future.

Do not let cynicism kill your spirit

No matter if this is just reddit or the “IT community” in general, it is not worth delving into such cynical discussions. I congratulate 10gen for being the only NoSQL database vendor in the DBMS popularity ranking that I’ve previously posted:

DBMS ranking. Reproduced with permission of DB-Engines.com

DBMS ranking. Reproduced with permission of DB-Engines.com

And I most certainly congratulate 10gen to this incredible milestone in their company history.

The 10 Most Popular DB Engines (SQL and NoSQL)

How to objectively measure the popularity of a DB engine? Good question! And there’s an Austrian company (Solid IT) who claims to have the answer. The company focuses on “Big Data und NoSQL“, but this focus does not seem to have biased the result of the measurement. Among the top 10 database engines, there is only MongoDB, which is not an RDBMS. And it’s astonishing just how popular MongoDB seems to be (although, they must be doing something right)!

Reproduced with permission of DB-Engines.com

Reproduced with permission of DB-Engines.com

Now, I’m not surprised by the top 3. I am definitely surprised by the fact that PostgreSQL and SQLite are not more popular. I am also surprised, that there aren’t more “wide-column stores” among the top 10. Maybe, Michael Stonebraker has to review his claims about the traditional RDBMS wisdom being all wrong?

And what about the other databases supported by jOOQ? Where are the Java databases? Here’s a condensed view of the ranking, consisting only of the 15 databases currently supported by jOOQ 3.1:

dbms-ranking-jooq

Reproduced with permission of DB-Engines.com

It turns out that Java databases (Derby, H2, HyperSQL) are not so popular compared to all the others. It also turns out that MariaDB still has a lot of grounds to gain, compared to MySQL.

The ranking considers a lot of data from various somewhat authoritative sources as is explained here. These include:

  • Number of mentions of the system on websites. Measured through search engine results.
  • General interest in the system. Measured through Google Trends.
  • Frequency of technical discussions about the system. Measured through Stack Overflow and similar.
  • Number of job offers, in which the system is mentioned. Measured through Indeed and similar.
  • Number of profiles in professional networks, in which the system is mentioned. Measured through LinkedIn.

This ranking is certainly something to keep an eye on!