A Guide to SQL Naming Conventions

One of Java’s big strengths, in my opinion, is the fact that most naming conventions have been established by the creators of the language. For example:

  • Class names are in PascalCase
  • Member names are in camelCase
  • Constants are in SNAKE_CASE

If someone does not adhere to these conventions, the resulting code quickly looks non-idiomatic.

What about SQL?

SQL is different. While some people claim UPPER CASE IS FASTEST:

Others do not agree on the “correct” case:

There seems to be a tendency towards writing identifiers in lower case, with no agreement on the case of keywords. Also, in most dialects, people prefer snake_case for identifiers, although in SQL Server, people seem to prefer PascalCase or camelCase.

That’s for style. And I’d love to hear your opinion on style and naming conventions in the comments!

What about naming conventions?

In many languages, naming conventions (of identifiers) is not really relevant, because the way the language designs namespacing, there is relatively little risk for conflict. In SQL, this is a bit different. Most SQL databases support only a 3-4 layered set of namespaces:

  1. Catalog
  2. Schema
  3. Table (or procedure, type)
  4. Column (or parameter, attribute)

Some dialect dependent caveats:

  • While SQL Server supports both catalog AND schema, most dialects only support one of them
  • MySQL treats the catalog (“database”) as the schema
  • Oracle supports a package namespace for procedures, between schema and procedure

In any case, there is no such concept as package (“schema”) hierarchies as there is in languages like Java, which makes namespacing in SQL quite tricky. A problem that can easily happen when writing stored procedures:

FUNCTION get_name (id NUMBER) IS
  result NUMBER;
BEGIN
  SELECT name
  INTO result
  FROM customer
  WHERE id = id; -- Ehm...

  RETURN result;
END;

As can be seen above, both the CUSTOMER.ID column as well as the GET_NAME.ID parameter could be resolved by the unqualified ID expression. This is easy to work around, but a tedious problem to think of all the time.

Another example is when joining tables, which probably have duplicate column names:

SELECT *
FROM customer c
JOIN address a ON c.id = a.customer_id

This query might produce two ambiguous ID columns: CUSTOMER.ID and ADDRESS.ID. In the SQL language, it is mostly easy to distinguish between them by qualifying them. But in clients (e.g. Java), they are less easy to qualify properly. If we put the query in a view, it gets even trickier.

“Hungarian notation”

Hence, SQL and the procedural languages are a rare case where some type of Hungarian notation could be useful. Unlike with hungarian notation itself, where the data type is encoded in the name, in this case, we might encode some other piece of information in the name. Here’s a list of rules I’ve found very useful in the past:

1. Prefixing objects by semantic type

Tables, views, and other “tabular things” may quickly conflict with each other. Especially in Oracle, where one does not simply create a schema because of all the security hassles this produces (schemas and users are kinda the same thing, which is nuts of course. A schema should exist completely independently from a user), it may be useful to encode a schema in the object name:

  • C_TABLE: The C_ prefix denotes “customer data”, e.g. as opposed to:
  • S_TABLE: The S_ prefix denotes “system data” or “master data”
  • L_TABLE: The L_ prefix denotes “log data”
  • V_VIEW: The V_ prefix denotes a view
  • P_PARAMETER: The P_ prefix denotes a procedure or function parameter
  • L_VARIABLE: The L_ prefix denotes a local variable

Besides, when using views for security and access control, one might have additional prefixes or suffixes to denote the style of view:

  • _R: The _R suffix denotes read only views
  • _W: The _W suffix denotes writeable (updatable) views

This list is obviously incomplete. I’m undecided whether this is necessarily a good thing in general. For example, should packages, procedures, sequences, constraints be prefixed as well? Often, they do not lead to ambiguities in namespace resolution. But sometimes they do. The importance, as always, is to be consistent with a ruleset. So, once this practice is embraced, it should be applied everywhere.

2. Singular or plural table names

Who cares. Just pick one and use it consistently.

3. Establishing standard aliasing

Another technique that I’ve found very useful in the past is a standard approach to aliasing things. We need to alias tables all the time, e.g. in queries like this:

SELECT *
FROM customer c
JOIN address a ON c.id = a.customer_id

But what if we have to join ACCOUNT as well? We already used A for ADDRESS, so we cannot reuse A. But if we don’t re-use the same aliases in every query, the queries start to be a bit confusing to read.

We could just not use aliases and always fully qualify all identifiers:

SELECT *
FROM customer
JOIN address ON customer.id = address.customer_id

But that quickly turns out to be verbose, especially with longer table names, so also not very readable. The standard approach to aliasing things I’ve found very useful is to use this simple algorithm that produces 4 letter aliases for every table. Given the Sakila database, we could establish:

PREFIX TABLE NAME
ACTO ACTOR
ADDR ADDRESS
CATE CATEGORY
CITY CITY
COUN COUNTRY
CUST CUSTOMER
FILM FILM
FIAC FILM_ACTOR
FICA FILM_CATEGORY
FITE FILM_TEXT
INVE INVENTORY
LANG LANGUAGE
PAYM PAYMENT
RENT RENTAL
STAF STAFF
STOR STORE

The algorithm to shorten a table name is simple:

  • If the name does not contain an underscore, take the four first letters, e.g CUSTOMER becomes CUST
  • If the name contains 1 underscore, take the first two letters of each word, e.g. FILM_ACTOR becomes FIAC
  • If the name contains 2 underscores, take the first two letters of the first word, and the first letter of the other words, e.g. FILM_CATEGORY_DETAILS becomes FICD
  • If the name contains 3 or more underscores, take the first letter of each word
  • If a new abbreviation causes a conflict with the existing ones, make a pragmatic choice

This technique worked well for large-ish schemas with 500+ tables. You’d think that abbreviations like FICD are meaningless, and indeed, they are, at first. But once you start writing a ton of SQL against this schema, you start “learning” the abbreviations, and they become meaningful.

What’s more, you can use these abbreviations everywhere, not just when writing joins:

SELECT 
  cust.first_name,
  cust.last_name,
  addr.city
FROM customer cust
JOIN address addr ON cust.id = addr.customer_id

But also when aliasing columns in views or derived tables:

SELECT 
  cust.first_name AS cust_first_name,
  cust.last_name AS cust_last_name,
  addr.city AS addr_city
FROM customer cust
JOIN address addr ON cust.id = addr.customer_id

This becomes invaluable when your queries become more complex (say, 20-30 joins) and you start projecting tons of columns in a library of views that select from other views that select from other views. It’s easy to keep consistent, and you can also easily recognise things like:

  • What table a given column originates from
  • If that column has an index you can use (on a query against the view!)
  • If two columns that look the same (e.g. FIRST_NAME) really are the same

I think that if you work with views extensively (I’ve worked with schemas of 1000+ views in the past), then such a naming convention is almost mandatory.

Conclusion

There isn’t really a “correct” way to name things in any language, including SQL. But given the limitations of some SQL dialects, or the fact that after joining, two names may easily conflict, I’ve found the above two tools very useful in the past: 1) Prefixing identifiers with a hint about their object types, 2) Establishing a standard for aliasing tables, and always alias column names accordingly.

When you’re using a code generator like jOOQ’s, the generated column names on views will already include the table name as a prefix, so you can easily “see” what you’re querying.

I’m curious about your own naming conventions, looking forward to your comments in the comment section!

Don’t Overdo the “Principle of Least Astonishment” Cargo Cult

As we all agree, GOTO is evil, right? Relevant XKCD

Or even funnier: New Intern Knows Best

Of course, GOTO isn’t evil

Of course, somewhere deep down in our professional selves, we know that GOTO isn’t evil, it’s just a very basic processor instruction that was available since the early days of assembly code. GOTO is everywhere. Take the following Java code, for instance:

public class Goto {
    public static void main(String[] args) {
        if (args.length == 0)
            System.out.println("No args");
        else
            System.out.println("Args");
    }
}

The bytecode that is generated from the above logic is this:

  // Method descriptor #15 ([Ljava/lang/String;)V
  // Stack: 2, Locals: 1
  public static void main(java.lang.String[] args);
     0  aload_0 [args]
     1  arraylength
     2  ifne 16
     5  getstatic java.lang.System.out : java.io.PrintStream [16]
     8  ldc  [22]
    10  invokevirtual java.io.PrintStream.println(java.lang.String) : void [24]
    13  goto 24
    16  getstatic java.lang.System.out : java.io.PrintStream [16]
    19  ldc  [30]
    21  invokevirtual java.io.PrintStream.println(java.lang.String) : void [24]
    24  return

There are two jumps. Byte code (like assembly code) doesn’t know anything about local scope nesting. There’s just a stream of instructions and we’re jumping around them. If goto were possible in Java (the keyword is reserved, so who knows about a future Java…?), we could write this:

public class Goto {
    public static void main(String[] args) {
        if (args.length != 0)
            goto args;

        System.out.println("No args");
        goto end;

        args:
        System.out.println("Args");

        end:
        return;
    }
}

Of course, the if-else construct is more readable and less error prone. Some of the biggest problems we’ve had with true GOTO instructions in the past is that they tend to lead to spaghetti code. In particular, notice how we’re not jumping from the “args statement” to the “end statement”. We’re just “falling through” like in a “bad” switch statement:

switch (i) {
    case 1:
    case 2:
        System.out.println("1 or 2");
        // no break
    case 3:
        System.out.println("3 (or 1 or 2!)");
}

If you will, the switch statement is very similar to a set of GOTO instructions.

The argument today is about readability, not executability

The case against “classic” GOTO is a case in favour of correctness.

These days, however, we don’t have to squeeze our code into 16KB of RAM anymore, these problems are problems of the past. So we’ve started “engineering” our code (whatever that means ;) ) according to “best practices”, most of which involve more humans than machines. We don’t really need “GOTO” anymore, since we have higher forms of control flow abstractions.

Java has “GOTO” in the form of labelled statements and break and continue statements. For instance:

Jumping forward

label: {
  // do stuff
  if (check) break label;
  // do more stuff
}

In bytecode:

2  iload_1 [check]
3  ifeq 6          // Jumping forward
6  ..

Jumping backward

label: do {
  // do stuff
  if (check) continue label;
  // do more stuff
  break label;
} while(true);

In bytecode:

 2  iload_1 [check]
 3  ifeq 9
 6  goto 2          // Jumping backward
 9  ..

Just recently, I’ve blogged about labelled statements making a case against extracting everything into a method. Why? Because a labelled statement is a (bit unusual) way of creating a local “method”. Squint really hard for a moment and let your imagination run wild with this:

Labelled statement Method
Label name Method name
Local variables prior to label Method arguments
Break statement Return statement
Continue statement Recursion (remember: squint :) )

So, if we admit the above, then labelled statements in Java are almost like one-shot local methods (oh how I wish Java had actual local methods). There’s no risk of doing things wrong like with GOTO. There’s no risk of any accidents in jumping to the wrong location, because we still have locally nested scope and everything is well defined. So, there’s no correctness argument against using labelled statements.

Principle of least astonishment

Why aren’t they used more often? Probably, because we rarely write such complicated imperative code where labelled statements become really useful. I do from time to time, especially inside jOOQ’s parser logic. Labelled continue is especially useful to break out of an inner loop, continuing the outer loop.

Labelled breaks (as mentioned in that previous article) are less commonly useful. Usually, it is easy to refactor an if-else branch into a more readable version (e.g. by inverting if and else) than by breaking out of it. But sometimes, breaking is the more readable version, e.g. because it leads to a more consistent indentation where each branch is at the same level of indentation.

Remember the analogy with methods? Break is the same as return. Every time you want to return early from a method, you could break early out of a statement. Why not? Does it have to be a new method every time? Certainly not. Sometimes, jumping to a method is more distractive than jumping locally (again, as long as Java doesn’t have local methods).

The important thing here is that simplicity and readability, just like beauty, are in the eye of the beholder. We’re in an area where we cannot clearly say that something is correct or wrong. Better or worse. Because again, unlike GOTO, breaking out of labelled statements yields no correctness risk.

So, people start arguing in favour of the “principle of least astonishment” (especially in the comments of the DZone version of my post). Sure. Labelled breaks are a bit harder to read, because we hardly ever see them in our code. That doesn’t mean they’re bad. We hardly ever use synchronized these days (with JDK library support having become much better), but that doesn’t mean synchronized is bad and shouldn’t be used. It’s rare, and thus we might get astonished. We don’t usually use exceptions as control flow signals, but why shouldn’t we? It is important to remind ourselves that the principle of least astonishment is a general guideline that allows for exceptions.

After all, astonishment is what makes us stop and think a bit more thoroughly about code. We might even learn something new, like the fact that we can break out of labelled statements.

I like Ryan James Spencer’s way of putting it:

Learning vim… It has become a joke that programmers don’t know how to exit vim:

(of course, it’s simply “:q!”, how hard can it be ;) )

The point is, yes this is surprising, astonishing, unexpected. If you’re not using vim frequently. If you’re using it frequently, it’s obvious, just like any other funky IDE keyboard shortcut. Or like using the middle mouse button to close tabs in almost any program (that was a revelation).

In order to be productive, we shouldn’t be astonished constantly. We should mostly be in our technical / infrastructure comfort zones, because most of us write business logic, and we want to spend our precious time and brain cells on that, not on the infrastructure logic. Everyone likes SQL. No one likes working around funky JDBC edge cases.

Another interesting reply by Chris Martin:

Absolutely. That was the point of my previous article. Breaking out of labelled statements is cool every now and then. It might be astonishing if you don’t write complex algorithms (like parsers) every day. It is not at all astonishing if you do.

So, please. Be open. Stop cargo culting. Use labelled statements and break out of them. Every once in a while, when you think that makes your code clearer.