Stop Claiming that you’re Using a Schemaless Database


One of MongoDB’s arguments when evangelising MongoDB is the fact that MongoDB is a “schemaless” database:

Why Schemaless?

MongoDB is a JSON-style data store. The documents stored in the database can have varying sets of fields, with different types for each field.

And that’s true. But it doesn’t mean that there is no schema. There are in fact various schemas:

  • The one in your head when you designed the data structures
  • The one that your database really implemented to store your data structures
  • The one you should have implemented to fulfill your requirements

Every time you realise that you made a mistake (see point three above), or when your requirements change, you will need to migrate your data. Let’s review again MongoDB’s point of view here:

With a schemaless database, 90% of the time adjustments to the database become transparent and automatic. For example, if we wish to add GPA to the student objects, we add the attribute, resave, and all is well — if we look up an existing student and reference GPA, we just get back null. Further, if we roll back our code, the new GPA fields in the existing objects are unlikely to cause problems if our code was well written.

Everything above is true as well.

“Schema-less” vs. “Schema-ful”

But let’s translate this to SQL (or use any other “schema-ful” database instead):

ALTER TABLE student ADD gpa VARCHAR(10);

And we’re done! Gee, we’ve added a column, and we’ve added it to ALL rows. It was transparent. It was automatic. We “just get back null” on existing students. And we can even “roll back our code”:

ALTER TABLE student DROP gpa;

Not only are the existing objects unlikely to cause problems, we have actually rolled back our code AND database.

Let’s summarise:

  • We can do exactly the same in “schema-less” databases as we can in “schema-ful” ones
  • We guarantee that a migration takes place (and it’s instant, too)
  • We guarantee data integrity when we roll back the change

What about more real-world DDL?

Of course, at the beginning of projects, when they still resemble the typical cat/dog/pet-shop, book/author/library sample application, we’ll just be adding columns. But what happens if we need to change the student-teacher 1:N relationship into a student-teacher M:N relationship? Suddenly, everything changes, and not only will the relational data model prove far superior to a hierarchical one that just yields tons of data duplication, it’ll also be moderately easy to migrate, and the outcome is guaranteed to be correct and tidy!

CREATE TABLE student_to_teacher 
AS
SELECT id AS student_id, teacher_id
FROM student;

ALTER TABLE student DROP teacher_id;

… and we’re done! (of course, we’d be adding constraints and indexes)

Think about the tedious task that you’ll have transforming your JSON to the new JSON. You don’t even have XSLT or XQuery for the task, only JavaScript!

Let’s face the truth

Schemalessness is about a misleading term as much as NoSQL is:

And again, MongoDB’s blog post is telling the truth (and an interesting one, too):

Generally, there is a direct analogy between this “schemaless” style and dynamically typed languages. Constructs such as those above are easy to represent in PHP, Python and Ruby. What we are trying to do here is make this mapping to the database natural.

When you say “schemaless”, you actually say “dynamically typed schema” – as opposed to statically typed schemas as they are available from SQL databases. JSON is still a completely schema free data structure standard, as opposed to XML which allows you to specify XSD if you need, or operate on document-oriented, “schema-less” (i.e. dynamically typed) schemas.

(And don’t say there’s json-schema. That’s a ridiculous attempt to mimick XSD)

This is important to understand! You always have a schema, even if you don’t statically type it. If you’re writing JavaScript, you still have types, which you have to be fully aware of in your mental model of the code. Except that there’s no compiler (or IDE) that can help you infer the types with 100% certainty.

An example:

… and more:

So, there’s absolutely nothing that is really easier with “schemaless” databases than with “schemaful” ones. You just defer the inevitable work of sanitising your schema to some other later time, a time when you might care more than today, or a time when you’re lucky enough to have a new job and someone else does the work for you. You might have believed MongoDB, when they said that “objects are unlikely to cause problems”.

But let me tell you the ugly truth:

Anything that can possibly go wrong, does

Murphy

We wish you good luck with your dynamically typed languages and your dynamically typed database schemas – while we’ll stick with type safe SQL.

jOOQ: The best way to write SQL in Java

24 thoughts on “Stop Claiming that you’re Using a Schemaless Database

  1. But let’s translate this to SQL (or use any other “schema-ful” database instead): ALTER TABLE student ADD gpa VARCHAR(10);

    You’re cheating. The equivalent of this DDL statement is NoSql is “”, not what’s mentioned in the quote. The other things like adding fields to queries and objects, etc., are equal, the DDL is an additional work. That said, I agree that it’s worth it.

    You don’t even have XSLT or XQuery for the task, only JavaScript!

    Why would anyone working in a different language use JavaScript for manipulating JSON? JavaScript is a joke, but JSON consists of objects and arrays/lists, which exists in practically any language.

    And don’t say there’s json-schema. That’s a ridiculous attempt to mimick XSD.

    What’s wrong with json-schema?

    • You’re cheating. The equivalent of this DDL statement is NoSql is “”, not what’s mentioned in the quote. The other things like adding fields to queries and objects, etc., are equal, the DDL is an additional work. That said, I agree that it’s worth it.

      Technically you’re right. But in the context of the argument, which claims that there is always a schema, even if your database doesn’t enforce it, the “DDL” is really performed in the application that accesses the database (and in all the other 20 applications that also access the database). True, that MongoDB doesn’t care about the DDL, and it is thus “” from a MongoDB perspective. But that doesn’t mean that you won’t eventually add a gpa column to your “schema”

      Why would anyone working in a different language use JavaScript for manipulating JSON? JavaScript is a joke, but JSON consists of objects and arrays/lists, which exists in practically any language.

      I was trying to say that if people had been using “schemaless” XML, they could at least use powerful tools around XML to transform / migrate the data. With JSON, there is hardly any powerful data transformation language.

      What’s wrong with json-schema?

      Reading through the IETF documents, I feel that every bit of this whole JSON thing is just there to mimick parts of XML in a very clumsy way based on the premise that XML is “too complex”, and that we somehow needed something simpler. This isn’t very specific criticism from my side, nor do I have any productive experience with json-schema. But once I see json-schema in a project, I sincerely wonder why JSON was preferred over XML.

      • > I was trying to say that if people had been using “schemaless” XML, they could at least use powerful tools around XML to transform / migrate the data. With JSON, there is hardly any powerful data transformation language.

        I never needed this, but why couldn’t I use XSLT? A transformation from JSON to XML is pretty trivial (and could be written with a stack and few regexes if there weren’t so many implementation available).

        > This isn’t very specific criticism from my side, nor do I have any productive experience with json-schema.

        And neither do I, so this probably leads to nowhere.

        > But once I see json-schema in a project, I sincerely wonder why JSON was preferred over XML.

        Whenever I see an XSD, I see a bloatware. It’s not exactly XML’s fault, but it seems to attract redundancy lovers. Creating namespaces for no reason at all, adding deeply nested attributes just for fun, adding attributes with a single allowed value, which won’t be expanded in a newer version as the never version uses a different schema with a new field, etc.

        My main objections against XML are readability and redundancy (Attributes are nothing but poor man’s elements, helpful to reduce verbosity, but there’s nothing like this in objects. And when used for settings, it may need to be expanded into an element later, and that creates needless problems).

        • I never needed this, but why couldn’t I use XSLT? A transformation from JSON to XML is pretty trivial (and could be written with a stack and few regexes if there weren’t so many implementation available).

          Well, you’d certainly have to accompany this “XSLT” implementation with a nice documentation listing all the 80% of XSLT features that don’t really work:-)

          And neither do I, so this probably leads to nowhere.

          😉

          OK then, so let’s stop here.

          • Just a short comment.

            > Well, you’d certainly have to accompany this “XSLT” implementation with a nice documentation listing all the 80% of XSLT features that don’t really work

            Obviously, nothing reading or producing attributes etc. will work. But no JSON user will see it as a non-working feature, or a feature at all; it just doesn’t apply.

  2. In MongoDB one document is not necessarily tied to any other document contained in the very same collection. That’s what it makes it schemaless. In RDBMS all rows must obey the same rules, so there’s a schema dictating each individual row structure.

    Both approaches have advantages and disadvantages. I’ll tell you of an use case where we had to use both RDBMS and MongoDB. We had a business case where the user could create an multi-step Order, but we wanted to allow him to pre-save any unfinished Order. So if the user has only completed just a few steps but hasn’t yet finished the Order we wanted to save the Order draft for him to continue it later. The final Order has a strict data structure so like many other entities, it went to RDBMS. The draft was an incomplete data structure, so we couldn’t save it in the same RDBMS that enforced not-null/unique constraints. So the drafts went to Mongo, because you can easily save complex nested structures without any schema constraints.

    It’s a matter of using the right tool for the right job.

    • Eh. You could’ve as well just dumped the “draft” in a BLOB. Or you could’ve dumped the “draft” in a “similar” table that is created as such:

      CREATE TABLE drafts AS
      SELECT *
      FROM the_real_thing
      WHERE 1 = 0
      

      From an operations point of view, I do hope that you had also 1-2 other good reasons to have a separate database management system!:-)

      • The project was using polyglot persistence anyway. This was only a use case, we had more of those scattered across various sub-systems.

        We wanted to the drafts to be queriable too, so the BLOB was out of question. I also separated traffic between write/reads. All intermediate draft operations went to Mongo instead of RDBMS so the only time we wrote to SQL was when the order was finished. This has minimized change rates and facilitated cache updates, which only happened when the documents were full.

        • The project was using polyglot persistence anyway. This was only a use case, we had more of those scattered across various sub-systems.

          Fair enough, I was sure that you had other use-cases. I was just checking and documenting this fact for future readers. I still believe that the potentially tremendous operational overhead of adding more than one productive DBMS really calls out for justification…

          I still believe that this particular use-case would have been dead simple with an all-SQL solution. If traffic is an issue, it’s usually very easy to just run a new instance in the same SQL container, possibly partitioned, sharded, with the big advantage (you somehow see this as a disadvantage) of being able to have an almost-identical schema for draft copies as for permanent storage. In such a case, also querying and transaction management would’ve been dead simple.

          Again, you had polyglot persistence anyway (and I’m personally tempted to challenge each one of your use-cases, should you want to take the time to list them😉 ), so maybe in your specific project set-up, this choice made sense. But I’m going to stay really hard to convince about the fact that anything good can derive from having too many DBMS in one system. You know…

          • Makes sense. I wouldn’t advice for polyglot persistence unless you have an large enterprise project that calls for such topology.

            This project was the largest media platform in one European country, so it had both funds and skilled people to conceive a solid architecture. Normally you run under strict budgets constraints and any decision must be analysed for financial impact.

            I can’t detail all use cases, because of confidentiality rules, but I can tell you that we’ve been using MongoDB as a queriable persistent cache and it performed very well.

            But mixing technology demands more skills, more resources, it makes the Ops life even harder. There are always unforeseen scenarios you have to address, so it has a high cost. Sometimes it’s worth paying it because it would cost more otherwise.

          • Sometimes it’s worth paying it because it would cost more otherwise.

            I guess that summarises it as a counter-argument against my general hammer/nail point of view. At some point, the hammer is no longer sufficient…

      • For complex structures with lots of relations, a single table probably won’t do it I could imagine. And while the BLOB might work, you’d probably need to write error prone serializers which you get free in your database driver / layer just dumping it into mongo so you save that step. At least that’s what I can imagine are entirely good reasons to use it in the scenario outlined.

        As for the blog post, MongoDB generally does not enforce a schema, that’s why people call it schema less I guess, because it can be a stark contrast when coming from SQL. However it is indeed not entirely true. You should take care and actually design a schema and conform to it. But since MongoDB does not enforce anything (aside from an unique constraint on an index field) you can break it at will. Schema non-enforcing sounds stupid however and let’s face it: It all starts with CREATE TABLE foo SOME DEFINITION in SQL, doesn’t it? There’s clearly a difference and if you want to read the “schema” of any MongoDB’s collection you really have to read every single object in that collection to be able to say what it is.

        So in the end, it kind of has a schema, but also doesn’t really.

        As for the reasons to have a seperate database management system. Come on, every one of us use NoSQL databases extensively every single day. It’s nothing special. Best tool for the job and all that. Few people realize that because we tend to call them by a different name. Yet filesystems are Non SQL databases. Very important ones, too.

        • For complex structures with lots of relations, a single table probably won’t do it I could imagine

          Nope. But you could duplicate entire parts of your schema. I would imagine that temporary persistence and permanent persistence shouldn’t be that different in the end, except for validation.

          you’d probably need to write error prone serializers which you get free in your database driver / layer just dumping it into mongo so you save that step

          I get this point, although if you’re a SQL-oriented person like me, then serialisation is really not an issue at all. It is so dirt-simple to serialise tuples and multisets of tuples that I wouldn’t consider this an argument. And modelling domain data in the form of tuples (rather than complex OO graphs) will also help you transition from cached versions -> temporary versions -> persisted versions. Of course, if you’re paid by the number of LoC that you’re writing to implement those various data representation model transitions, then OO-model-driven design is clearly better for you😉

          There’s clearly a difference and if you want to read the “schema” of any MongoDB’s collection you really have to read every single object in that collection to be able to say what it is.

          Yes, that’s a quite interesting thought. And by introspecting all those objects, you’ll probably also get some stale elements that are no longer valid / in use.

          Yet filesystems are Non SQL databases. Very important ones, too.

          I agree, and I wish they were more SQL oriented ones!:-) Imagine to be able to query the file system with actual SQL, and have file attributes behave like tuple attributes. Unfortunately, WinFS never made it…

          In any case, I’ve worked on a couple of applications including E-Banking systems, or Adobe Experience Manager (ironically, based on JCR, a NoSQL database) where the file system was omitted entirely and everything was stored in the database, which in turn obviously stored stuff on a file system. Being able to rely on an RDBMS’s transaction model without knowing whether the storage will be performed on Windows, Linux, Solaris, or worse: AIX and related process models, file system models, etc. is really a treat!

          Heck, I’ve recently implemented a simple “Java engine” that fetches *.java files from Oracle, compiles them on the fly (upon change), and class-loads compilation output at run time!:-)

          OK, I do realise that I’m clearly a hammer/nail person, but my general time-to-market is favouring many of the above decisions.

          So, I’m also challenging the use of file systems via direct access:-)

    • sorry – but that just means you dont have someone on the team who understands how to use an rdbms. the constraints were defined poorly. I save partial orders and payments and plenty of other part-state business data in my relational db’s all the time. I am stunned at the illiteracy of the software development community these days. you have a high performing data engine and still you would prefer to write complex application layer code to do the things the database does in its sleep. There is no single use case for using a noSQL over a relational database in the world. we are suffering from the last ten years of front end developers viewing the database as somewhere to “stuff” their application defined data model. And as the original author suggests, the data model for the problem domain will long outlast any application code.
      Rails is a case in point. DHH eschews the database… get it out of my way… but also promotes the view that data validation rules should be in one (code) place only in the application. Great… but now what happens when two applications need to share the same database… hmm… lets put validation and complex business logic in two applications! yes… great idea…LOL!
      ORM’s and rails silly (and very weak) equivalents of db validations and ‘hopeful’ locking (wont even call it optimistic), are nuts. learn to drive a database and your app development will be 100% simpler.

      Nice article overall – kudos to you for sticking your neck out and saying the emperor has no clothes! :)

      • Nice article overall – kudos to you for sticking your neck out and saying the emperor has no clothes!

        Thanks. I sometimes believe that the emperor was just a troll to start with😉

  3. I agree with pretty much everything you have written, but for the sake of avoiding selling snake oil its worth pointing out that adding/removing/changing a column in an immature RDBMS like MySQL on any table of a decent size (who gives a shit about less than 1M rows) can be a real pain point. Also made me wish for Mongo (almost)

    • Immature RDBMS… Well, this blog runs on it, like millions of other blogs. And micro blogs. And Facebook. And pretty much all of Social media. Incredible, what’ll happen when they all just moved to a “mature” RDBMS!

      Anyway. So. When do you change / remove the column in Mongo? When you’ve piled up 5 years of legacy trash data that you didn’t want to change / remove right back then?:-)

      • My point wasn’t that you can’t use MySQL for anything successful nor that Mongo or ‘NoSQL’ is that answer. Simply that last time I made a column nullable in MySQL it took 6 hours and 80G of space to do it, please not I couldn’t use an online DDL statement as that failed after 6 hours. So we are talking 6+ hours of blocked transactions on that instance.

        This would not have been an issue in Mongo, instead the lack of transcations and data safety would have no doubt bitten me in the ass.

        All in all prefer an RDBMS and prefer Oracle and Postgresql over MySQL.

  4. I’m interested about your opinion about replication in RDBMS (in this case Postgres) and MongoDB. I was working in team that develop/maintain one of the biggest e-shops in my country. The problem we had was that load on application was too big and we needed to split it into 3 servers (load balancing on tomcat) + 1 server with admin panel for our customer. The problem was that replication MASTER-MASTER worked shitty. We had a lot of blocks because of conflicts on both DB. On the other side I heard that Mongo can handle replication more nicely.

    The second thing is that we had problems with adding columns, i.e. there was table with 10’000’000 rows and ¬260 columns. Adding single column (with null value) took about 20-30 min and a lot of space on disk. In Mongo there would be no problem, just save the object:). Also worth mentioning is that we had some columns that in 9’999’998 cases was null and only with 2 rows was filled with some data – in my opinion that is big waste of space. (FYI this table was containing shop orders)

    • I have no production experience with replication for PostgreSQL – although I do hear a lot that this is a hot, not yet solved topic (e.g. where Oracle or MySQL would work much better). People often resort to sharding, where data isn’t really replicated, but partitioned across different nodes.

      As far as adding columns is concerned, it is a pain that in some RDBMS, adding tables is a blocking operation. This isn’t the case in Oracle, though, where new nullable columns can be added almost immediately. Of course, you do realise that in the event of 9’999’998 empty rows vs. 2 present rows, you would inevitably resort to normalising the column into a one-to-one relationship, no…?

      • How Oracle/MySQL deals with replication?

        Of course, you do realise that in the event of 9’999’998 empty rows vs. 2 present rows, you would inevitably resort to normalising the column into a one-to-one relationship

        Client said that this column would probably be null in ~10%. He lied…😉

        • How Oracle/MySQL deals with replication?

          I cannot tell you unfortunately. I haven’t been lucky enough to work with the DBAs doing that stuff yet…

          Client said that this column would probably be null in ~10%. He lied

          They always lie:)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s