There is no Such Thing as Object-Relational Impedance Mismatch

Much of the ORM criticism of the last decade missed the point, being inaccurate. By the end of this article, we will conclude with the following:
There is no significant difference between the relational (data) model and object oriented models
How to come to this conclusion? Read on!

How we came to believe in this fallacy

Many popular bloggers and opinion leaders have missed no chance to bash ORMs for their “obvious” impedance mismatch with the relational world. N+1, inefficient queries, library complexity, leaky abstractions, all sorts of buzzwords have been employed to dismiss ORMs – often containing a lot of truth, albeit without providing a viable alternative.

But are these articles really criticising the right thing?

Few of the above articles recognise a central fact, which has been elicited eloquently and humorously by Erik Meijer and Gavin Bierman in his very interesting paper “A co-Relational Model of Data for Large Shared Data Banks“, subtitled:
Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.
Or in other words: The “hierarchical” object world and the “relational” database world model the exact same thing. The only difference is the direction of the arrows that you draw in your diagrams. Let this sink in.
  • In the relational model, children point to their parent.
  • In the hierarchical model, parents point to their children.
That’s all there is to it. hierarchical-vs-relational

What is an ORM?

ORMs fill the bridge between the two worlds. They’re the inverters of arrows, if you will. They will make sure that every “relation” in your RDBMS can be materialised as an “aggregation” or “composition” in your “hierarchical” world (this works for objects, XML, JSON, and any other format). They make sure that such materialisation is properly transacted. That changes to individual attributes or to relational (aggregational, compositional) attributes are properly tracked and purged back into the master model, the database – where the model is persisted. Individual ORMs differ in terms of offered features and in how much mapping logic they offer in addition to mapping individual entities to individual types.
  • Some ORMs may help you implement locking
  • Some may help you to patch model mismatches
  • Some may focus merely on a 1:1 mapping between these classes and tables
But all ORMs do one very simple thing. Ultimately, they take rows from your tables and materialise them as objects in your class model and vice-versa. A very nice overview of different ORMs has been compiled on the Vertabelo blog, recently, by the way.

Tables and classes are the same thing

Give or take 1-2 implementation details, an RDBMS’s table and an OO language’s class is the same thing. A specification of a set of grouped attributes, each with their associated type. Consider the following example, using SQL and Java: SQL

CREATE TABLE author (
  first_name VARCHAR(50),
  last_name VARCHAR(50)
);

Java

class Author {
  String firstName;
  String lastName;
}

There is absolutely no conceptual difference between the two – the mapping is straightforward. The mapping is even straightforward when you consider “relations” / “compositions” between different entities / types: SQL (let’s leave away constraints for simplicity)

CREATE TABLE author (
  id BIGINT,
  first_name VARCHAR(50),
  last_name VARCHAR(50)
);

CREATE TABLE book (
  id BIGINT,
  author_id BIGINT,
  title VARCHAR(50),
);

Java

class Author {
  Long id;
  String firstName;
  String lastName;
  Set<Book> books;
}

class Book {
  Long id;
  Author author;
  String title;
}

The implementation details are omitted (and probably account for half of the criticism). But omitting further details allows for straight-forward 1:1 mapping of individual rows from your database to your Java model, without any surprises. Most ORMs – in the Java ecosystem Hibernate in particular – have managed to implement the above idea very well, hiding away all the technical details of actually doing such a model transfer between the RDBMS and Java. In other words:
There is absolutely nothing wrong with this mapping approach!

Yet: There *IS* an impedance mismatch, somewhere

The “problems” that many bloggers criticise arise not from the non-existing mismatch between the two model representations (“relational” vs. “hierarchical”). The problems arise from SQL, which is a decent implementation of relational algebra. In fact, the very same mismatch that everyone criticises is also present between: Relational algebra has been defined in order to be able to query relations and to form new ad-hoc relations as an output of such queries. Depending on the operations and transformations that are applied, the resulting tuples may have absolutely nothing to do with the individual entities involved in a query. In other, ORM-y words: The product of relational algebra, and in particular of SQL has no use, as it can no longer be further processed by the ORM, let alone persisted back into the database. To make things “worse”, SQL today is a large super-set of the features offered by relational algebra. It has gotten much more useful than when it was conceived.

Why this mismatch still affects modern ORMs

The previous paragraphs outlined the single main reason why ORMs are really criticised, even if such criticism often doesn’t mention this exact reason:
SQL / relational algebra is not really appropriate to partially materialise relations into a client / store changes back into the database. Yet, most RDBMS offer only SQL for that job.
Back to the author / book example. When you want to load and display an author and their books to a web application’s user, you’d like to simply fetch that author and their books, call simple methods like author.add(book) as well as author.remove(book) and let some magic flush your data back into the storage system. Thinking about the amount of SQL code to be written for such a simple CRUD task makes everyone squeal.
tweet thisLife’s too short to spend time on CRUD
Perhaps QUEL might have been a better language for CRUD, but that ship has sailed. And unfortunately, because of SQL being an inappropriate language for this job, you cannot ignore that “magic” but have to know well what happens behind the scenes, e.g. by tweaking Hibernate’s fetching strategies. Translated to SQL, this may be implemented in several ways: 1. Fetching with JOIN Using outer joins, all the involved entities can be queried in one go:

SELECT author.*, book.*
FROM author
LEFT JOIN book ON author.id = book.author_id
WHERE author.id = ?

Advantages:
  • A single query can be issued and all the data can be transferred at once
Disadvantages:
  • The author attributes are repeated in every tuple. The client (ORM) has to de-duplicate authors first, before populating the author-book relationship. This can be particularly bad when you have many nested relations that should be fetched at once.
2. Fetching with SELECT A single query is issued for each entity:

SELECT *
FROM author
WHERE id = ?

SELECT *
FROM book
WHERE author_id = ?

Advantages:
  • The amount of data to be transferred is minimal: Each row is transferred exactly once.
Disadvantages:
  • The amount of queries that are issued may explode into the well-known N+1 problem.
Hibernate in particular knows other types of fetch strategies, although they are essentially a variant / optimisation of one of the above.

Why not use SQL MULTISET?

The ideal way to fetch all data in this case using advanced SQL would be by using MULTISET:

SELECT author.*, MULTISET (
  SELECT book.*
  FROM book
  WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?

The above will essentially create a nested collection for each author:
first_name  last_name   books (nested collection)
--------------------------------------------------

Leonard     Cohen       title
                        --------------------------
                        Book of Mercy
                        Stranger Music
                        Book of Longing

Ernest      Hemingway   title
                        --------------------------
                        For Whom the Bell Tolls
                        The Old Man and the Sea
If you add another nested entity, it is easy to see how another MULTISET could allow for additionally nested data:

SELECT author.*, MULTISET (
  SELECT book.*, MULTISET (
    SELECT c.*
    FROM language AS t
    JOIN book_language AS bl
    ON c.id = bc.language_id
    AND book.id = bc.book_id
  ) AS languages
  FROM book
  WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?

The outcome would now be along the lines of:
first_name  last_name   books
-----------------------------------------------------

Leonard     Cohen       title            languages
                        -----------------------------
                        Book of Mercy    language
                                         ------------
                                         en

                        Stranger Music   language
                                         ------------
                                         en
                                         de

                        Book of Longing  language
                                         ------------
                                         en
                                         fr
                                         es
Advantages:
  • A single query can materialise all eager-loaded rows with minimal bandwidth usage.
Disadvantages:
  • None.

Unfortunately, MULTISET is poorly supported by RDBMS.

MULTISET (as well as arrays and other collection types) have been introduced formally into the SQL standard as of SQL:2003, as a part of an initiative to embed OO features into the SQL language. Oracle, for instance, has implemented much of it, much like Informix did, or the lesser-known CUBRID (although using vendor-specific syntax). Other databases like PostgreSQL allow for aggregating nested rows into typed arrays, which works the same way although with a bit more syntactic effort. MULTISET and other ORDBMS SQL features are the perfect compromise, allowing for combining the best of the “relational” model with the best of the “hierarchical” model. Allowing for combining CRUD operations with querying in one go, removing the need for sophisticated ORMs, as the SQL language can be used directly to map all your data from your (relational) database to your (hierarchical) client representation with no friction.

Conclusion and call to action!

We’re living through exciting times in our industry. The elephant (SQL) in the room is still here, learning new tricks all the time. The relational model has served us well, and has been enriched with hierarchical models in various implementations. Functional programming is gaining traction, complementing object orientation in very useful ways. Think of the glue, putting all these great technological concepts together, allowing for:
  • Storing data in the relational model
  • Materialising data in the hierarchical model
  • Processing data using functional programming
That awesome combination of techniques is hard to beat – we’ve shown how SQL and functional programming can work with jOOQ. All that’s missing – in our opinion – is better support for MULTISET and other ORDBMS features from RDBMS vendors. Thus, we urge you, PostgreSQL developers: You’re creating one of the most innovative databases out there. Oracle is ahead of you in this area – but their implementation is too strongly tied to PL/SQL, which makes it clumsy. Yet, you’re missing out on one of the most awesome SQL feature sets. The ability to construct nested collections (not just arrays), and to query them efficiently. If you lead the way, other RDBMS will follow. And we can finally stop wasting time talking about the object-relational impedance non-mismatch.

36 thoughts on “There is no Such Thing as Object-Relational Impedance Mismatch

  1. Actually the table and class are treating associations differently. In a database, only the child table knows about its relationship to the parent table. With a bidirectional entity association, you can have the parent side know aware of its children as well.

    There is a mismatch, because you can’t treat object associations as if you are using an in-memory storage. Fetching associations, adding and removing them must all be translated to database operations and that must be done in an effective manner, as otherwise you get N+1 issues or other redundant database operations.

    1. That’s an implementation detail, and no, foreign key constraints are known to the parent table as well as they have to enforce the ON DELETE and ON UPDATE semantics (if supported) on the parent.

      Anyway, did you read the full article? There is no mismatch between the models and semantics. The only mismatch is between the models and what you call “operations”, because popular implementations of those “operations” currently cannot reliably materialise the model to clients.

  2. The article mistakes structs for objects. How would behaviour (which is what objects add to structs) be stored in a relational database? How is *that* not an impedance mismatch?

    1. When ORDBMS were discussed (mostly late 90s), objects as you interpret them moved into the database by allowing for methods (behaviour) to be placed on relations. Oracle has implemented this, PostgreSQL to some extent, CUBRID and Informix, too.

      The approach proved rather complex and was never really popular just as much as the bahviour that you mention wasn’t ever really popular on the objects that are used to model relations in clients (entities).

      Object orientation is a very heavily loaded term. A lot of expectations / models are merged into single types (such as Java classes, for instance) without necessarily being related. E.g. you could have identity without behaviour, but you cannot in Java. In Java 10 (hopefully) you can have behaviour without identity, though. In the end, the point of the article isn’t to get confused with object-orientation and its legacy, but with hierarchical data models (with OO being one approximation thereof). Behaviour is really secondary in this discussion.

    2. I suspect that whoever came up with the IM term was trying to sound scientific by borrowing it from physics. In the RDM R-tables do NOT encapsulate, only domains do and intentionally so.

      Incidentally, data model was defined by Codd as consisting of structure, manipulation and integrity. Can somebody specify–precisely please!–
      what are the formal OO equivalents to the R-table, relational algebra and
      the several types of integrity constraints that each R-table is subject to?

      1. Perhaps, the fuzziness of semi-convincing marketing and associated terms just always wins over the “preciseness” of “science”. In other words, content wins over form because content is ultimately more useful and practical.

        1. Or “useful and practical” is the lie they tell themselves so they can keep going round in circles, pretending they’re continually “making progress”.

          1. So, we can conclude that both sides of the argument will lie to themselves in order not to have to listen to the other side.

            Mankind at its best :-)

            Cheers

  3. The “hierarchical” object world and the “relational” database world model the exact same thing.

    Well, yeah, both model reality, and modulo differences in perception being mistaken for reality, we all assume quite safely that there is only one reality. So saying they model the same thing is just kicking in open doors, if not for the side-effect of distracting the unwary’s attention from the real issue.

    The only difference is the direction of the arrows that you draw in your diagrams.

    You clearly conflate conceptual modeling with logical modeling. The relational model is quite manifestly NOT about “diagrams”, let alone the direction in which arrows are drawn in them. Logically modeling a relational database means inventorising all the relations that are needed to record all relevant information (which involves inventorising their respective external predicates as well as the domains they are defined over) and inventorising (ALL) the constraints that govern the database. No graphical modeling notation will ever achieve the latter – for that you just need mathematics notation.

    Most of the flaws in the remainder of your article are a derivative of this misconception. There is, however, one more claim I would like to comment on : “Tables and classes are the same thing”. In “The Third Manifesto” (which is the work by Chris Date & Hugh Darwen that should lead the relational model into the 21st century and also make the marriage OO/RM a feasible one), this claim is classified as one of the “Two Great Blunders” (plus of course a lengthy argumentation of why this is so). Summarized : Tables are variables and classes are types. Variables are not the same thing as types, ergo tables cannot “be the same thing” as classes.

    1. Well, yeah, both model reality, and modulo differences in perception being mistaken for reality, we all assume quite safely that there is only one reality.

      Nice observation.

      No graphical modeling notation will ever achieve the latter – for that you just need mathematics notation.

      Why do you think that a graphical modeling notation is not as powerful as a mathematical one? What is the exact difference? Perhaps, I should write a follow-up article about them being the exact same thing ;-)

      You’re right. The article is conflating the two notions of modeling, based on the assumption that the distinction is irrelevant. People do use directed arrows in their graphical models, which helps them understand the implementation they’re going to use for each model type (hierarchical: Java, relational: SQL). The hidden implication of the article, however, is that there are no directions in either model just like you suggest. I’m sure that it will be possible to imagine also that arrows are indeed constraints that are independent of the boxes in the diagrams and thus they do not contradict what you’re claiming.

      Most of the flaws in the remainder of your article are a derivative of this misconception.

      Feel free to list them individually.

      Tables are variables and classes are types.

      Classes (in Java) have a type. Whether they are the type is arguable. It’s obviously hard to discuss theory on a legacy language that even the designers often regret.

      But fine. Since you came here to correct individual usage of terms, feel free to correct the terms as you seem fit. I’m sure you’ll find terms (like row type) that can be substituted to make the statement formally correct.

      1. Do you really mean to say that you don’t see a difference between graphical and mathematical notation? That’s a much worse blunder.

        As to predicates: they have an intension and an extension. The former is the specification of the criterion for membership of a row in the R-table; the latter is the body of the R-table. Informally, I call the former the type specification and the latter the class, but they are distinct of the OO class and type concepts. I consider this more useful than arguing with OO, which is a programming, not data management paradigm.

        1. Do you really mean to say that you don’t see a difference between graphical and mathematical notation? That’s a much worse blunder.

          Graphical and symbol languages are similar in many ways. Perhaps, you’d like to point out what exactly is fundamentally different in the context of this discussion here?

          I consider this more useful than arguing with OO, which is a programming, not data management paradigm.

          OO mixes both programming and data management paradigms.

          1. Aw, c’mon.

            No, OO is a programming paradigm extended only subsequently with something called ODM which was never formally defined.

      2. “Feel free to list them individually.”

        Here goes.

        (1) “They will make sure that every “relation” in your RDBMS can be materialised as an “aggregation” or “composition” in your “hierarchical” world”

        This is the common mistake of confusing E/R relationSHIPS with RM relationS. (Itself a typical and omnipresent consequence of conflating E/R modeling with relational modeling). In RM, entities as well as relationships end up being relations, but some entities might end up as types, and some relationships might end up being a constraint.

        (Another mistake in that sentence is the misconception -equally omnipresent- that relationships in an E/R diagram represent “navigational links that can be followed” in a relational database. (Accompanied by the corrollary -equally flawed of course- that if there’s no arrow then there’s no link that can be followed.) People who think that create a perception of “relational” being just another kind of jacket for the same thing as what they were/are used to from, say, IMS, IDMS, or Neo. They have totally misunderstood and have totally missed the point.)

        (2) “In fact, the very same mismatch that everyone criticises is also present between: the relational model relational algebra”

        Nowhere in your article do you prove your total lack of understanding of the RM so clearly as with this howler. Those two things are about the only things in data management land between which NO MISMATCH EXISTS AT ALL. They are Symbiotic Siamese Twins. The relational model would be useless if it didn’t fit the algebra perfectly, and the algebra would a lot less useful if the RM allowed information to be represented as anything else than a relation. (This, of course, is once again a natural consequence of not knowing what a relation in the relational model really is.)

        (3) “The product of relational algebra, and in particular of SQL has no use”

        This is having things completely backwards. It takes the usability and fitness for purpose of the ORM’s as they currently exist as an axiomatic requirement, to then observe that what existed before the ORM’s (RA) can produce things that don’t fit the ORM’s and therefore the previously existing thing cannot be any good.

        Also, your observation that “the result of a query cannot be used directly to go updating using the data in the queried form”, has in fact nothing to do with ORM’s per se. It’s the ancient problem of view updating. Which has indeed not yet been solved satisfactorily.

        ORM’s have *CHOSEN* to “restrict” their scope of operation to the class of data that “can go both ways”. That they cannot cope with data of the class that “can go only from data server to OO client” is a consequence of their own choice.

        To claim that “one way data” (data queried in a form that does not directly match the structure of an updatable component in the database), as you say”, “has no use” is outright stupid. In fact, the very property of being able to go still at least one way, shows what its use is.

        (4) “Thinking about the amount of SQL code to be written for such a simple CRUD task makes everyone squeal.”

        DELETE FROM BOOK WHERE book_id = … ;

        Wow. That’s really a reason for squealing.

        Many people who make the mistake you made here, because, entrenched as they are in their vision of things being “connected” in the database, they have a mindset that prevents them from realising that in a relational database, it is possible to consider books and manipulate them _WITHOUT_ also having to bring in every nitty gritty little detail about the authors. People usually get this entrenchment by using the products of the ORM vendors, and they write their ORMs the way they do because they are locked up in those same trenches (or because they find it a comfortable place to be, nice cozy and crowded if everyone follows them, I don’t know).

      3. “Classes (in Java) have a type.”

        Well, then I think it would be useful to go through some documentation such as the formal java language spec.

        E.g. Section 4.1 “types” : There are two kinds of types in the Java programming language: primitive types (§4.2) and reference types (§4.3).

        And section 4.3.1 : “There are four kinds of reference types: class types (§8.1), interface types (§9.1), type variables (§4.4), and array types (§10.1).”

        Or go through the runtime javadoc and see how many times they use the word “type” when they clearly mean “class” (this is a habit that varies with the author of the doc at hand).

  4. First of all, thank you for this thought provoking blog :-).

    I think you have proved yourself wrong by saying:

    “But all ORMs do one very simple thing. Ultimately, they take rows from your tables and materialise them as objects in your class model and vice-versa.”

    Imo if you need any kind of “materialise” (= transform, …) then there is a mismatch. In the same way there is a mismatch between OO and JSON representation (one of the things json can’t directly represent is cycles for example).

    The other thing I can’t agree with is your notion of a “relation” – you don’t seem to distinguish between relation in the mathematical sense, which is set of tuples, and a relationship (as graph databases adopted this). Thanks to this the graph model is much closer to OO than relational, but still there is a mismatch. And yes, you can model relationship as a relation, but they are not the same thing!

    I should probably add that I don’t see being close to OO model as a holy grail. I use relational database when I want to do relational algebra (= usually aggregation on the data) – that’s what it was designed for, but when I need OO model (= data with behaviour) then I will chose something much closer to OO to persist my data.

    1. First of all, thank you for this thought provoking blog

      You’re welcome :-) And thanks for being nice. The other commenters tend to agree with you, yet have chosen a slightly different tone. I suspect that the topic is very emotional :-)

      Imo if you need any kind of “materialise” (= transform, …) then there is a mismatch.

      Not in theory, as the materialisation just means that the remote data (from the database) is cached in the client(s). The article uses “materialise” as a verb. You could name things differently. The point is that an ORM makes (parts of) the data available to a client, pretending that the data is local, not remote.

      There is an interesting point of view in the reddit discussion about this article. “Earth_C-137” claims:

      It’d be more aptly called the Local-Remote Impedance Mismatch.

      That’s true, but it’s not a mismatch between the models.

      In the same way there is a mismatch between OO and JSON representation (one of the things json can’t directly represent is cycles for example)

      Neither can OO. A reference is just a pointer, not the real object, and a pointer is just an implementation of a concept. You can implement pointers in JSON (by convention). This has been done in XML before. I agree it can get a bit hairy, but all hairiness are implementation flaws, not necessarily a flaw in the model itself.

      I admit, though, that JSON is an extremely poor implementation choice for graph modelling.

      And yes, you can model relationship as a relation, but they are not the same thing!

      You’re right, of course, but the distinction was not essential for the point of this article – although, I do admit that if the article had been more precise about terms, there would have been less distraction and bike-shedding both in the comments and on reddit…

      when I need OO model (= data with behaviour) then I will chose something much closer to OO to persist my data.

      I’m curious, what do you choose? And how does that improve over relational databases?

      1. I’m curious, what do you choose? And how does that improve over relational databases?

        If my data (and the queries on them) fit simple hierarchical aggregate then a document database like mongodb. If the data are more linked then a graph database like neo4j.

        It brings me simplicity. I ask the question differently – what does relational database bring me that I need? Most of the time nothing and the code and data model is simpler with graph or document database.

        Where relational databases (and SQL) really shine is when you do aggregation on large portion of your data.

    2. Correct: to the extent that there is an “ODM”, it is a directed graph model (network) which was discarded decades ago in favor of the RDM, because
      it was too inflexible and complex.

      A mathematical relation is a set that IS a relationship between sets–domains (a subset of their Cartesian product).

      The only way in which “mismatch” makes sense is if you assume OO ****application development**** against relational databases.

      Incidentally, the fact that 2 appoaches “model the same thing” does not
      mean that they are equally sound, practical, easy, simple, flexible and so on.

      1. Incidentally, the fact that 2 appoaches “model the same thing” does not mean that they are equally sound, practical, easy, simple, flexible and so on.

        Indeed. Our mind can be bent only this far to intuitively follow a transformation between models. And few people have the time and/or passion to go beyond intuition in such transformations.

  5. I think there _is_ a mismatch, not in the mathematical models for relational and OO, but in how we work with each/what type of operations each one supports.

    In the OO world, mostly everything is based on side effects (must not be, but is – few objects are immutable). Side effects we want to cause are changes to particular objects. For this, we simply tell the object to execute an operation on itself, and it will know how to change.

    Incidentally or not, this is the same with noSQL stores: you read an object (whether it’s a JSON or a value for a key doesn’t matter), change it as you want, then write it back.

    With a relational database, you _never_ modify a single object. You always modify a table. This would correspond to the situation in which you’d call a specific operation on all instances of a particular class. True, you can filter the records to be updated so that your operation is applied to a single record, but that’s inconsequential for the discussion at hand.

    That’s what ORMs try to bridge. The problem caused by this mismatch is that having the ORMs update individual objects is hugely inefficient. ORMs try to compensate with smart caching techniques – which may cause even more performance problems – a few examples: http://www.slideshare.net/Codemotion/performance-antipatterns-in-hibernatee-by-patrycja-wegrzynowicz.

    ORMs make working with objects a mine field. Moreover, the DAO/repository pattern we see in so many applications is in fact an anti-pattern – entity classes have no logic, which resides in services – data and operations on it are split. There’s no way you can build a truly active, complex, working application model with passive entities and stateless services, which should be also convenient to work with, specifically because of this split between data and logic, and the fact that you have to know your ORM inside out, in order to avoid common performance pitfalls.

    NoSQL, OTOH, is way friendlier in this regard. It’s less powerful, re. the type and quantity of logic you can put into the store (or at least is so in most cases). The motivation for its development was a completely different one than that for relational stores, but that’s also inconsequential for this discussion. What’s important is that (by coincidence) noSQL stores proved to be a lot friendlier in an OO environment. There’s not much care needed to avoid performance issues (or at least not in development – it’s something fully manageable by ops), the fact that there’s no logic in the store and that there’s no imposed structure by the store allow you to put function-rich stateful objects very naturally inside the store, and this in turn avoids the schizophrenia brought about by the service/entity pattern so common with relational stores and ORMs.

    True, many if not most enterprise/LOB apps will benefit from the facilities of a relational store. However, even for those many cases I think an ORM is a less than ideal fit, at least from a developer’s point of view. You’re walking on thin ice, performance-wise, most of the time (luckily ORMs and the hardware are fast enough to let even the clumsiest applications, persistence wise, survive in production, most of the time), and there’s a huge blur, usually, between what logic you put into the database (into indexes, foreign keys, stored procedures or triggers), and what logic you keep in your source code. ORMs do nothing to help you there. They offer to manage everything for you. However, the more powerful the elements you cede control to the ORM, the more awkward and complex their management becomes. In a project developed with teams of many people, there’s no way to draw a clear line, separating what should be put in the database from what should not. With logic split between source code and the database, tracing the logic becomes extremely tedious. And the risks of causing the ORM to do way more work than needed are even higher.

  6. Both the relational model and relational algebra (of which SQL is a derivative) are set-oriented. There are no child objects pointing to parent objects, the are child sets (tables) containing columns holding keys that are present in parent sets.
    The object model is oriented towards individual instances of classes. The classes merely impose a pattern on the instances but play no role in the dynamic object graphs.
    This is the principal mismatch. The object model can (quite awkwardly) be modeled by a relatively small subset of the relational model and that is exactly what ORMs do.

    The misconception that dealing with relational data is in any way similar to object persistence is the cause of the misuse of ORMs. Relational data is very useful and often it makes sense to have parts of the data temporarily represented by volatile (in memory) object instances. ORMs can help automate this and they are fine for that. Good systems design starts with a relational design, optionally followed by a derived object design.

    Unfortunately people often go about it the other way around. They come up with an object model and need a persistence mechanism. In comes the ORM and the RDBMS and then they are surprised that things don’t work as smoothly as they expected.

    1. There are no child objects pointing to parent objects

      The same is true for the object model. There are no parent objects pointing to sets of child objects. There is a number / key that helps the JVM look up an object of a collection type, which contains more numbers / keys that helps the JVM look up more objects that happen to be the children.

      So, much of the criticism on this article has criticised the confusion between the actual theory (relational model) and the application in real world scenarios when doing ERDs, where relationships are indeed a very important modelling part. Of course, the essence of the article can be refuted by discussing only theory, or only implementation.

      Unfortunately people often go about it the other way around. They come up with an object model and need a persistence mechanism. In comes the ORM and the RDBMS and then they are surprised that things don’t work as smoothly as they expected.

      That’s true. Much of the confusion stems from this, but again, I believe that this is related to the implementation of various technologies involved. In principle, if ORDBMS extensions are allowed (like MULTISET, or inheritance, which isn’t dealt with in this article, but which exists in ORDBMS), there wouldn’t be much mismatch left conceptually.

  7. To me, the whole idea of MULTISET looks stupid. In your image, first_name and last_name don’t get repeated for MULTISET, and you’re claiming that it minimizes the bandwidth usage. Sure, it does, but…

    WHY should they be repeated when using a LEFT JOIN? There are obviously repeated in a printout I get e.g. from a command line client. But there’s no reason to transfer repetitions physically, that’s just something between the DB server and the (e.g. JDBC) driver. It’s a rather trivial compression, so if someone forgot to do it, they should feel ashamed and they should fix it. Silently. Instead of introducing yet another keyword. And instead of forcing the clients to deal with differently structured data.

    What I mean is: forget MULTISET, transfer data as if it was used instead of the JOIN. Whenever possible.

    AFAIK all what’s needed on the server side is to guarantee that the rows don’t get permuted needlessly (with “ORDER BY title”, you’d break the order yourself, so it’s your problem; without any order specified, the DB would be forced not to break books by one author apart from each other).

    AFAIK apart from a tiny driver improvement, nothing is needed on the client side. There could be a method in ResultSet for querying the repetitions, which would save deduplication work in tools like Hibernate.

    1. You’re arguing against more semantics in a language? That’s new, and I’ll bite! :)

      A little counter-example: Imagine a ton of LEFT JOIN that ultimately produce a COUNT(*) which we want to order descendingly in order to rank the left joined path. You would like to discourage this sort of practice because of low level network protocol optimisations that people should be aware of?

      Another counter-example: Let’s imagine we use relational algebra to completely detach the result from any underlying table information (because we can and do with SQL, all the time). What is your formal idea of two consecutive rows being partially duplicated? How do you know they’re even related after we have removed any key information? What happens if I go wild about combining LEFT / RIGHT / FULL / and perhaps even LATERAL JOIN, not to mention the reordering of columns that I like to do in SELECT, mixing table columns with complex expressions? Is it still possible to know what “duplicates” are formal duplicates and what duplicates are “accidents”?

      1. Apologies for being that provocative. In case I’m wrong, you’re the right person to enlighten me.

        > You’re arguing against more semantics in a language? That’s new, and I’ll bite!

        I’m just arguing against additional syntactical constructs which IMHO don’t pull their weight. And which need support on the client side. Can I work with MULTISET in Java? Do Hibernate or EclipseLink make use of it? I Does Spring support it? I don’t know, but Hibernate BatchFetchStyle does not seem to.

        I see SQL as an engine applications run on, so I’d leave the complexity to the applications. An engine needn’t be fancy, it needs to be simple and powerful.

        > A little counter-example: …

        Sorry, I’m not getting it. Could you write some SQL? If the rows get aggregated to a COUNT, then there are no redundant data to be transferred, are they?

        > You would like to discourage this sort of practice…

        No. They should do whatever they want, and if performance matters, dive into the details. There are tons of such details already, e.g., one popular database is too stupid to optimize dependent subqueries, so they have to use joins instead.

        > … low level network protocol optimisations that people should be aware of?

        For those knowing about multiset, it’s just: If the JOIN could be written as a MULTISET, then it can be transferred more efficiently.

        For the majority: If they don’t break the order of the “master entity”, it will be transferred more efficiently.

        For a counter-example, I’d need to see the loss when MULTISET gets replaced by a JOIN. My claims are:

        1. The bandwith can be exactly the same.
        2. JOIN is supported by all tools.
        3. MULTISET is a more structured (good!) view of an equivalent result set, which needs to be supported (bad!).
        4. With some support from the DB, the JOIN result could be processed in the same way MULTISET can.

        1. Apologies for being that provocative. In case I’m wrong, you’re the right person to enlighten me.

          Don’t apologise :) It’s great!

          I’m just arguing against additional syntactical constructs which IMHO don’t pull their weight.

          You’re probably right, which is why this probably never made it into any database outside of Oracle and Informix.

          Can I work with MULTISET in Java?

          Yes, you can have nested ResultSet. JDBC is already fully MULTISET ready. The Hibernate developers were unaware of it when I discussed it with them at the time: https://twitter.com/Hibernate/status/549911307039035392. Not sure if Spring has any support for it. As Oracle is the only DB that really supports it, and as Oracle went a bit farther, requiring all MULTISET types to be named collections (TABLE or VARRAY types), the only “popular” client interaction is a lot trickier than it should be.

          An engine needn’t be fancy, it needs to be simple and powerful.

          Fancy is subjective. Is OUTER JOIN fancy? It can be expressed with INNER JOIN and UNION. Powerful is exactly what MULTISET is. And a part of the SQL standard, too. So it perfectly fits your idea of SQL, I think :)

          Sorry, I’m not getting it. Could you write some SQL? If the rows get aggregated to a COUNT, then there are no redundant data to be transferred, are they?

          I was trying to hint at the fact that your network protocol optimisation forces people to order by very specific columns only. My example would do something like (pseudo SQL)

          SELECT a.*, b.*, count(*)
          FROM a JOIN b JOIN c JOIN d
          GROUP BY a, b
          ORDER BY count(*) DESC
          

          Suddenly, there’s no way to optimise this. Now, imagine we add LIMIT on a.*, but we want all b.* per A. With MULTISET, again pseudo SQL:

          SELECT a.*, MULTISET(b.*), count(*)
          FROM a JOIN c JOIN d
          GROUP BY a
          ORDER BY count(*) DESC
          LIMIT 10
          

          This is rather hard to do with JOIN. And I can think of many other examples where MULTISET would offer functionality that is just not exactly the same thing as JOIN…

          1. > Is it still possible to know what “duplicates” are formal duplicates and what duplicates are “accidents”?

            Interesting question.

            – For the bandwidth, it doesn’t matter, even an accidental duplicate saves bandwidth.
            – For Hibernate it doesn’t matter either as they always fetch the ID.
            – Normally, you don’t care. If you do, just fetch the ID.

            > Fancy is subjective. Is OUTER JOIN fancy?

            For me definitely not. The important thing here is that it doesn’t alter anything important, especially the shape of the ResultSet. So it needs no special support on the client-side. Moreover, it’s something you need all the time.

            Not even FULL OUTER JOIN is fancy. It’s rarely needed, hardly ever implemented, but it’s just some join. Nothing really new, nothing what needs any support.

            > Powerful is exactly what MULTISET is. And a part of the SQL standard, too.

            The standard doesn’t help me, unless it gets implemented by what I’m using. Oracle is expensive and Informix is exotic.

            > I was trying to hint at the fact that your network protocol optimisation forces people to order by very specific columns only.

            Yes and no. If you use MULTISET(b.*), it imposes exactly the same restriction, namely it precludes the first example from your answer. Let’s say its result looks like

            a b count
            ---------
            1 2 9
            2 1 8
            1 1 7
            2 2 6
            

            Now, there’s no optimization as the rows are wildly permuted. But there’s also NO use for a MULTISET in this query. So I restate that “the bandwith can be exactly the same”.

            > Now, imagine we add LIMIT on a.*, but we want all b.* per A.

            SELECT a.*, MULTISET(b.*), count(*)
            FROM a JOIN c JOIN d
            GROUP BY a
            ORDER BY count(*) DESC
            LIMIT 10
            

            OK, limiting one table is a problem. But… when LIMIT can’t do what we want, should we blame (and replace) JOIN or should we blame (and improve) LIMIT?

            SELECT a.*, b.*, count(*)
            FROM a LEFT JOIN b JOIN c JOIN d
            GROUP BY a
            ORDER BY count(*) DESC, a.*
            LIMIT 10 ON a
            

            OK, I added a different monsterconstruct to the game instead of MULTISET. Not really good.

            But what about a subquery returning everything but “b” and joining “b” later:

            SELECT a.*, b.*, cnt FROM
              (SELECT a.*, count(*) as cnt
               FROM a JOIN c JOIN d
               GROUP BY a
               ORDER BY count(*) DESC, a.*
               LIMIT 10
              ) AS ac
              LEFT JOIN b
            

            As it’s just a flattened MULTISET, my protocol optimization applies.

            1. For Hibernate it doesn’t matter either as they always fetch the ID. Normally, you don’t care. If you do, just fetch the ID.

              That’s where the confusion stems from. You’re thinking in terms of entities. I’m thinking in terms of arbitrary SQL usage, e.g. for reporting. Those are entirely different worlds. I think I made the “mistake” of claiming that my use-case will (if supported by more databases) also impact the entity graph traversal world. I still claim this would have happened, because one of the main features of an ORM is to do all that boring flattening / unflattening work when materialising the entity graph. That might not have been such a use-case, if the SQL language already had a native means of modelling that.

              But anyway. We’ll disagree on this point, regardless of the examples we’ll pull out of our sleeves. Indeed, your LIMIT examples are similar, but much more tedious (less expressive) to write. So, I’ll just accept you don’t like the idea.

              And, I’ll thus stop arguing here, given that this is a hypothetical blog post anyway (and given that MULTISET will never make it into other RDBMS) :)

  8. Excellent post.
    I’ve designed DBs for 30 years and whenever I’ve had the chance to work closely with the devs, I’ve never had a problem with mis-match. David C. Hay concurs with you.
    Shame about some of the pedantic comments.
    BTW the crows foot is the wrong way on your diagram.

    Peace.

    1. BTW the crows foot is the wrong way on your diagram.

      Whoops, thank you. Fixed.

      Don’t worry about the pedantic comments. This topic does attract people who like to refer to extreme formalism :)

    1. Yes indeed, XML and JSON both offer a way to generate ad-hoc nested collections, although the purely SQL ones would be much more powerful in my opinion.

Leave a Reply