How to use SQL PIVOT to Compare Two Tables in Your Database

This can happen ever so easily. You adapt a table by adding a new column:

ALTER TABLE payments ADD code NUMBER(3);

You go on, implementing your business logic – absolutely no problem. But then, later on (perhaps in production), some batch job fails because it makes some strong assumptions about data types. Namely, it assumes that the two tables payments and payments_archive are of the same row type:

CREATE TABLE payments
  (
    id         NUMBER(18) NOT NULL,
    account_id NUMBER(18) NOT NULL,
    value_date DATE,
    amount     NUMBER(25, 2) NOT NULL
  );

CREATE TABLE payments_archive
  (
    id         NUMBER(18) NOT NULL,
    account_id NUMBER(18) NOT NULL,
    value_date DATE,
    amount     NUMBER(25, 2) NOT NULL
  );

Being of the same row type, you can simply move a row from one table to the other, e.g. using a query like this one:

INSERT INTO payments_archive
SELECT * FROM payments
WHERE value_date < SYSDATE - 30;

(not that using the above syntax is a good idea in general, it’s actually a bad idea. but you get the point)

What you’re getting now is this:

ORA-00913: too many values

The fix is obvious, but probably, the poor soul who has to fix this is not you, but someone else who has to figure out among possibly hundreds of columns, which ones don’t match. Here’s how (in Oracle):

Use PIVOT to compare two tables!

You could of course not use PIVOT and simply select all columns from either table from the dictionary views:

SELECT 
  table_name,
  column_name
FROM all_tab_cols
WHERE table_name LIKE 'PAYMENTS%'

This will produce the following result:

TABLE_NAME         COLUMN_NAME                  
------------------ ---------------
PAYMENTS           ID                             
PAYMENTS           ACCOUNT_ID                     
PAYMENTS           VALUE_DATE                     
PAYMENTS           AMOUNT                         
PAYMENTS           CODE                           
PAYMENTS_ARCHIVE   ID                             
PAYMENTS_ARCHIVE   ACCOUNT_ID                     
PAYMENTS_ARCHIVE   VALUE_DATE                     
PAYMENTS_ARCHIVE   AMOUNT    

Not very readable. You could of course use set operations and apply INTERSECT and MINUS (EXCEPT) to filter out matching values. But much better:

SELECT *
FROM (
  SELECT 
    table_name,
    column_name
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) 
PIVOT ( 
  COUNT(*) AS cnt
  FOR (table_name) 
  IN (
    'PAYMENTS' AS payments, 
    'PAYMENTS_ARCHIVE' AS payments_archive 
  ) 
) t;

And the above now produces:

COLUMN_NAME  PAYMENTS_CNT PAYMENTS_ARCHIVE_CNT
------------ ------------ --------------------
CODE                    1                    0 
ACCOUNT_ID              1                    1 
ID                      1                    1 
VALUE_DATE              1                    1 
AMOUNT                  1                    1 

It is now very easy to identify the column that is missing from the PAYMENTS_ARCHIVE table. As you can see, the result from the original query produced one row per column AND per table. We took that result and pivoted it “FOR” the table name, such that we will now only get one row per column

How to read PIVOT?

It’s easy. Comments are inline:

SELECT *

-- This is the table that we're pivoting. Note that
-- we select only the minimum to prevent side-effects
FROM (
  SELECT 
    table_name,
    column_name
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) 

-- PIVOT is a keyword that is applied to the above
-- table. It generates a new table, similar to JOIN
PIVOT (

  -- This is the aggregated value that we want to
  -- produce for each pivoted value
  COUNT(*) AS available 

  -- This is the source of the values that we want to
  -- pivot
  FOR (table_name) 

  -- These are the values that we accept as pivot
  -- columns. The columns names are produced from
  -- these values concatenated with the corresponding
  -- aggregate function name
  IN (
    'PAYMENTS' AS payments, 
    'PAYMENTS_ARCHIVE' AS payments_archive 
  ) 
) t;

That’s it. Not so hard, was it?

The nice thing about this syntax is that we can generate as many additional columns as we want, very easily:

SELECT *
FROM (
  SELECT 
    table_name,
    column_name,
    CAST(data_type AS varchar(6)) data_type
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) 
PIVOT ( 
  COUNT(*) AS cnt,
  MAX(data_type) AS type -- new function here
  FOR (table_name) 
  IN (
    'PAYMENTS' AS p, 
    'PAYMENTS_ARCHIVE' AS a
  ) 
) t;

… producing (after additional erroneous DDL) …

COLUMN_NAME      P_CNT P_TYPE      A_CNT A_TYPE
----------- ---------- ------ ---------- ------
CODE                 1 NUMBER          0
ACCOUNT_ID           1 NUMBER          1 NUMBER 
ID                   1 NUMBER          1 NUMBER 
VALUE_DATE           1 DATE            1 TIMESTAMP
AMOUNT               1 NUMBER          1 NUMBER

This way, we can discover even more flaws between the different row types of the tables. In the above example, we’ve used MAX(), because we have to provide an aggregation function, even if each pivoted column corresponds to exactly one row in our example – but that doesn’t have to be.

What if I’m not using Oracle?

SQL Server also supports PIVOT, but other databases don’t. You can always emulate PIVOT using GROUP BY and CASE. The following statement is equivalent to the previous one:

SELECT
  t.column_name,
  COUNT(CASE table_name 
        WHEN 'PAYMENTS' THEN 1 END) p_cnt,
  MAX  (CASE table_name 
        WHEN 'PAYMENTS' THEN data_type END) p_type,
  COUNT(CASE table_name 
        WHEN 'PAYMENTS_ARCHIVE' THEN 1 END) a_cnt,
  MAX  (CASE table_name 
        WHEN 'PAYMENTS_ARCHIVE' THEN data_type END) a_type
FROM (
  SELECT 
    table_name,
    column_name,
    data_type
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) t
GROUP BY
  t.column_name;

This query will now produce the same result on all the other databases as well.

Isn’t that… ?

Yes, it is! The above usage of aggregate functions in combination with CASE can be “shortened” (as in less keywords) even more, using the SQL standard FILTER clause, which we’ve blogged about recently.

So, in PostgreSQL, you could write the following query:

SELECT
  t.column_name,
  COUNT(table_name) 
    FILTER (WHERE table_name = 'PAYMENTS') p_cnt,
  MAX  (data_type) 
    FILTER (WHERE table_name = 'PAYMENTS') p_type,
  COUNT(table_name) 
    FILTER (WHERE table_name = 'PAYMENTS_ARCHIVE') a_cnt,
  MAX  (data_type) 
    FILTER (WHERE table_name = 'PAYMENTS_ARCHIVE') a_type
FROM (
  SELECT 
    table_name,
    column_name,
    data_type
  FROM information_schema.columns
  WHERE table_name LIKE 'payments%'
) t
GROUP BY
  t.column_name;

Further reading

Excited? Yes. There are more awesome SQL features in various databases. Read on about:

The Awesome PostgreSQL 9.4 / SQL:2003 FILTER Clause for Aggregate Functions

Sometimes when aggregating data with SQL, we’d love to add some additional filters. For instance, consider the following world bank data:

GDP per capita (current US$)

          2009    2010    2011    2012
CA      40,764  47,465  51,791  52,409	
DE      40,270  40,408  44,355  42,598	
FR      40,488  39,448  42,578  39,759	
GB      35,455  36,573  38,927  38,649	
IT      35,724  34,673  36,988  33,814	
JP      39,473  43,118  46,204  46,548	
RU       8,616  10,710  13,324  14,091	
US      46,999  48,358  49,855  51,755	

And the table structure:

CREATE TABLE countries (
  code CHAR(2) NOT NULL,
  year INT NOT NULL,
  gdp_per_capita DECIMAL(10, 2) NOT NULL
);

Now, let’s assume we’d like to find the number of countries with a GDP higher than 40,000 for each year.

With standard SQL:2003, and now also with the newly released PostgreSQL 9.4, we can now take advantage of the new FILTER clause, which allows us to write the following query:

SELECT
  year,
  count(*) FILTER (WHERE gdp_per_capita >= 40000)
FROM
  countries
GROUP BY
  year

The above query will now yield:

year   count
------------
2012   4
2011   5
2010   4
2009   4

And that’s not it! As always, you can use any aggregate function also as a window function simply by adding an OVER() clause to the end:

SELECT
  year,
  code,
  gdp_per_capita,
  count(*) 
    FILTER (WHERE gdp_per_capita >= 40000) 
    OVER   (PARTITION BY year)
FROM
  countries

The result would then look something like this:

year   code   gdp_per_capita   count
------------------------------------
2009   CA           40764.00       4
2009   DE           40270.00       4
2009   FR           40488.00       4
2009   GB           35455.00       4

jOOQ 3.6 will also support the new FILTER clause for aggregate functions

Good news for jOOQ users. You can write the same query with jOOQ intuitively as such:

DSL.using(configuration)
   .select(
        COUNTRIES.YEAR,
        count().filterWhere(
            COUNTRIES.GDP_PER_CAPITA.ge(40000)
        ))
   .from(COUNTRIES)
   .groupBy(COUNTRIES.YEAR)
   .fetch();

… and

DSL.using(configuration)
   .select(
        COUNTRIES.YEAR,
        COUNTRIES.CODE,
        COUNTRIES.GDP_PER_CAPITA
        count().filterWhere(
                   COUNTRIES.GDP_PER_CAPITA.ge(40000))
               .over(partitionBy(COUNTRIES.YEAR)))
   .from(COUNTRIES)
   .fetch();

And the best thing is that jOOQ (as usual) emulates the above clause for you if you’re not using PostgreSQL. The equivalent query would be:

SELECT
  year,
  count(CASE WHEN gdp_per_capita >= 40000 THEN 1 END)
FROM
  countries
GROUP BY
  year

jOOQ: The best way to write SQL in Java

Read more about what’s new in PostgreSQL 9.4 here