jOOQ Tuesdays: Gerald Sangudi and Keshav Murthy Reveal the Secrets of N1QL (SQL on JSON)

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Gerald Sangudi and Keshav Murthy, creators of the the N1QL query language, CouchBase’s SQL-based JSON querying language.

Hi Gerald and Keshav – It’s great to have two people on the jOOQ Tuesdays interview at once! How did you guys meet?

Keshav: Gerald interviewed me for my the job at Couchbase.

Gerald: Best decision we ever made.

You’re both working at CouchBase, one of the leading “JSON databases” in the market. What drove you towards JSON?

Keshav: I worked for IBM, specifically on the Informix database. In 2013, Websphere teams needed JSON support. I was skeptical of JSON for data management except for managing metadata.  Years before, I had successfully opposed implementing XML and XQuery inside Informix. Then I went to a NoSQL conference where I saw enterprises using JSON in real-world applications.  I saw enterprises like eBay, Cisco developing their enterprise applications on NoSQL and JSON.

RDBMS customers had been asking for ALTER TABLE ONLINE when they needed to add new columns. They typically would have additional unused  columns which they’d rename them when needed.  Obviously, this wasn’t an optimal or a good solution because you could run out these columns, types or structure wasn’t flexible.

For years, customers had been ALTER TABLE while application were online.  I realized, JSON was a good way to provide it. That’s when we added JSON as a type, added sharding, and started to extend SQL to support JSON.  A good relational database can deliver lot of value JSON.  So, we developed a mechanism to exploit JSON to provide schema flexibility on relational tables and we received a patent for it: http://bit.ly/2og1uXe.

At Couchbase, data is stored as JSON and N1QL was designed as SQL for JSON.  So, we keep calm and JSON.

Gerald: I am not at Couchbase any more, but I was there for four years. Couchbase has its roots in memcached and CouchDB, so Couchbase was already storing and managing JSON data before N1QL and before we arrived. N1QL was developed to enable our users to query and manipulate their JSON data.

You’re both working on, or have worked on N1QL, CouchBase’s innovative SQL-esque query language for JSON. Has the SQL language come full circle? Why is the SQL syntax such a great fit for you?

Keshav: Don Chamberlin, co-inventor of SQL said this. During XQuery discussion, he argued against using SQL for manipulating XML. Now, with N1QL and SQL++ on JSON, he sees enormous possibilities for SQL to manipulate JSON effectively.  So, you could say, it has come a full circle.

From the Application point of view, requirement for SQL to support complex data models has been there.  SQL-99 added the structured type into the language, but didn’t recognize need for schema flexibility.

What Gerald has very nicely is to inherit SQL and  extend not just the language but the
underlying boolean logic.  

SQL has three valued logic: TRUE, FALSE, NULL.  N1QL has 4-valued logic: TRUE, FALSE, NULL and MISSING.  SQL has the select-join-project operations. N1QL has those and adds NEST and UNNEST operations for handling arrays.  Once we have the logic and operations, the type system, rest of the expressions for handling nested objects and arrays can be added.

There is another important change in N1QL compared to SQL.  SQL  is the query language to manipulate data.  N1QL can discover the document metadata (names, structure and types) dynamically and operate on it.

Gerald: SQL is the greatest and most successful query language of all time. Our job was to enable our users to query data that is sometimes different from what standard SQL expects. We hope N1QL does that.

As an aside, the “N1” in N1QL stands for non-first normal form. SQL geeks like you Lukas may know that “non-first” is one primary difference between JSON data and relational data. The other primary differences are schema and uniformity.

Will your relational competition steal features from N1QL? Or will you steal more features from them?

Keshav: I do hope relational databases steal features from N1QL. Having common approaches solve problems makes it easier for customers to choose the right database for the right problem. I do hope they choose Couchbase more often than RDBMS!

Gerald always says, we don’t differ from SQL unless there is a good reason. In that sense, we’ve taken lot of the features from relational databases already.  In addition, we learn from successful models in relational databases for things like index design, query optimization, security and monitoring.  We stand on shoulders of giants.

Gerald: We hope that both relational and non-relational vendors steal features from each other. There is a collaborative effort on something called SQL++, which is a superset of both SQL and N1QL. We hope SQL++ is the convergence point. One lesson from the success of SQL is that standards are great for both users and vendors.

CouchBase doesn’t have what you call a “static schema”. The biggest advantage of a static schema for the database optimiser is the many ways such a schema can be used to predict performance and choose optimal execution plans. How does optimisation work in a “schema-less” database?

Keshav: Actually, static schema gives you the structure, but not the data distribution, which is a major factor for calculating the execution cost. N1QL uses the information within the query and available indexes to glean the structure and decide on the plan. For example, if you have a query with a predicate:  WHERE state = “CA” and zipcode = “94040”, and there is an index on either state, zipcode or both, we’d assume these key-values exist in the document and push down the predicates to index scan.  We given further details on an article on DZone:  https://dzone.com/articles/a-deep-dive-into-couchbase-n1ql-query-optimization

Separately, we do have a mechanism to INFER the schema by sampling.  Right now, we show the inferred schema so users can understand the structure.  We also use the inferred schema to make query editing easier with hints and validations within the workbench.  We do have plans to use this, collect additional statistics to improve decisions in the optimizer.

Gerald: The N1QL optimizer is one of the joys of working on N1QL. As Keshav said, most of the SQL optimization techniques carry over to N1QL. The data may not have a static schema, but the query has an implicit schema, and the indexes have implicit schemas. That is, the query has predicates and other characteristics, and each index has keys and other characteristics. That is enough to keep an optimizer busy.

Working for a newer database vendor, you don’t have as much legacy, so you can innovate faster and more freely. What will be the next big thing in the database market?

Keshav: Our metric for innovation is the progress customer trying to make in their business. So, we innovate within the constraint of a customer job.  That helps us to innovate and measure its success from customer point of view.

We see customers deploying NoSQL databases for newer patterns like systems of engagement.

When you plan your vacation, you search a lot before you buy.  Search for places, hotels, airlines, things-to-do.  Then you compare costs, ratings, before you make the final purchase. For a travel company, these require significant database infrastructure to support high number of queries with low latencies and at a very low cost.

We see customers deploying NoSQL databases to support lot of the customer engagement and information requirement use cases.  RDBMS is still used as system of record for buy-sell-cancel-checkin-etc operations but will integrate with system of engagement databases to enhance customer experience.

Doing this effectively requires innovations in every area: data platform designs, scale out, query processing, index designs and manageability.

Gerald: I read Jeff Bezos’ latest annual letter to shareholders.  He says that the things that don’t change are more important than the things that do change. Databases should store your data reliably and give you answers and updates quickly. If database vendors continue to do that, we’ll be ok.

When to Choose SQL and When to Choose NoSQL

Some people make architecture decisions purely based on the loudest consultant:

For most others, however, decisions are not as simple as this. For instance: When should we start evaluating NoSQL storage systems as an alternative to RDBMS?

RDBMS as the hammer for all nails

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

The same is true for XML/JSON in the database (don’t forget, JSON is just XML with less syntax and less features, so it’s less awesome than XML). Sometimes you need to actually store document structures (hierarchical data) in their hierarchical form in your database, instead of normalising them. Sure, you could normalise a document, but that’s a lot of work and not necessarily beneficial. This was actually the question on Twitter by Gareth, that triggered the subsequent discussion:

Most modern RDBMS offer means to store (and much more importantly, query) XML/JSON data structures in simple forms. This includes PostgreSQL, Oracle, DB2, SQL Server and others.

So, when do we decide to switch?

As developers, we may be inclined to switch quite quickly. E.g. when working with graphs, we’d love to use Neo4j which implements the awesome Cypher query language. When working with JSON, we’d love to use something like Couchbase which implements the interesting N1QL query language. Both of these query languages are heavily inspired by SQL, which were wise vendor choices in my opinion (as opposed to MongoDB’s weird JSON-based query language), because in the end, the SQL language is still the most powerful and popular 4GL that was ever created.

But as developers, we shouldn’t make that decision lightly. Yes, at first, these specialised databases seem like a better fit. But the additional cost for operations teams to maintain, monitor, patch, tune production systems must not be underestimated. This is even true within the RDBMS ecosystem. A prominent recent example was Uber’s switch (back!) from PostgreSQL to MySQL:

Why Uber Engineering Switched from Postgres to MySQL

Do note that they switched the other way round before, only to regret things. Truth is, there are tons of reasons why your operations teams prefer to always use the same database, even if that’s quite expensive in terms of licensing. In many cases, however, it would be even more expensive to:

  • Engage in additional license and/or support contracts with new database vendors
  • Find skilled DBA for the new technology (can be very hard with niche databases)
  • Maintain two data silos and possibly sustain the cost of keeping them in sync

Ultimately, there’s a threshold:

In terms of using JSON in the database, it’s simple:

  • Occasional JSON storage: Stick with RDBMS.
  • Everything is JSON: Perhaps not RDBMS.

The same is true for graph problems. SQL can totally handle graphs and traverse them with recursive SQL. Here’s a funky statement that recursively calculates the subset sum problem:

subset-sum

Full article here: How to Find the Closest Subset Sum with SQL

If you only have a few tree/graph traversals to calculate (e.g. for a simple menu structure), don’t jump ship with RDBMS yet. If graphs are your business and that’s all you do, then probably RDBMS aren’t the right choice.

Conclusion

Whatever problem you’re solving. Do remember. Yes, all you have is a hammer, every problem starts looking like a nail. But don’t dismiss RDBMS as a silly hammer. It is a very mighty hammer and in 2016, it can do a lot of non-relational niche things decently well.

RDBMS is still the best default choice for all sorts of data problems. Only once you have gone above a certain threshold (or if you can foresee doing that), then you should look for alternatives. Because those alternatives will give you a much rougher time when you go outside of that niche (JSON, graphs, etc.) back to your “ordinary” relational business.