The following two indexes are redundant in most SQL databases:
CREATE INDEX i_actor_1 ON actor (last_name);
CREATE INDEX i_actor_2 ON actor (last_name, first_name);
It is usually safe to drop the first index, because all queries that query the
LAST_NAME
column only can still profit from the second index
I_ACTOR_2
. The reason being that
LAST_NAME
is the first column of the composite index
I_ACTOR_2
(it would be a different story, if it weren’t the first column).
Note: It is
usually safe to drop the first index, because the benefits probably outweigh the cost:
Benefits of dropping
Costs of dropping
- Querying a composite index can be slightly slower as can be seen in the below benchmark
Let’s see the costs of dropping the index below for Oracle, PostgreSQL, and SQL Server
in this particular case (beware as always when interpreting benchmarks, they heavily depend on a lot of context, especially data size!)
Oracle
Preparation:
CREATE TABLE t (
a NUMBER(10) NOT NULL,
b NUMBER(10) NOT NULL
);
INSERT INTO t (a, b)
SELECT level, level
FROM dual
CONNECT BY level <= 100000;
CREATE INDEX i1 ON t(a);
CREATE INDEX i2 ON t(a, b);
EXEC dbms_stats.gather_table_stats('TEST', 'T');
Benchmark:
SET SERVEROUTPUT ON
CREATE TABLE results (
run NUMBER(2),
stmt NUMBER(2),
elapsed NUMBER
);
DECLARE
v_ts TIMESTAMP WITH TIME ZONE;
v_repeat CONSTANT NUMBER := 2000;
BEGIN
-- Repeat benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT /*+INDEX(t i1)*/ * FROM t WHERE a = 1
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, 1,
SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT /*+INDEX(t i2)*/ * FROM t WHERE a = 1
) LOOP
NULL;
END LOOP;
END LOOP;
INSERT INTO results VALUES (r, 2,
SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
END LOOP;
FOR rec IN (
SELECT
run, stmt,
CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio
FROM results
)
LOOP
dbms_output.put_line('Run ' || rec.run ||
', Statement ' || rec.stmt ||
' : ' || rec.ratio);
END LOOP;
END;
/
DROP TABLE results;
The result being:
Run 1, Statement 1 : 1.4797
Run 1, Statement 2 : 1.45545
Run 2, Statement 1 : 1.1997
Run 2, Statement 2 : 1.01121
Run 3, Statement 1 : 1.13606
Run 3, Statement 2 : 1
Run 4, Statement 1 : 1.13455
Run 4, Statement 2 : 1.00242
Run 5, Statement 1 : 1.13303
Run 5, Statement 2 : 1.00606
Some notes on benchmarks here.
The fastest query execution in the above result yields 1, the other executions are multiples of 1. Yes, there’s a 10% difference in this case, so as you can see. The benefits (faster insertions) certainly should outweight the cost (slower queries), so, don’t apply this advice in a read-heavy / write-rarely database.
PostgreSQL
A similar difference can be seen in a PostgreSQL benchmark. No hints can be used to choose indexes, so we’re simply creating two tables:
CREATE TABLE t1 (
a INT NOT NULL,
b INT NOT NULL
);
CREATE TABLE t2 (
a INT NOT NULL,
b INT NOT NULL
);
INSERT INTO t1 (a, b)
SELECT s, s
FROM generate_series(1, 100000) AS s(s);
INSERT INTO t2 (a, b)
SELECT s, s
FROM generate_series(1, 100000) AS s(s);
CREATE INDEX i1 ON t1(a);
CREATE INDEX i2 ON t2(a, b);
ANALYZE t1;
ANALYZE t2;
Benchmark:
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 10000;
rec RECORD;
BEGIN
-- Repeat benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT * FROM t1 WHERE a = 1
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Statement 1: %', r,
(clock_timestamp() - v_ts);
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT * FROM t2 WHERE a = 1
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Statement 2: %', r,
(clock_timestamp() - v_ts);
RAISE INFO '';
END LOOP;
END$$;
Result:
INFO: Run 1, Statement 1: 00:00:00.071891
INFO: Run 1, Statement 2: 00:00:00.080833
INFO: Run 2, Statement 1: 00:00:00.076329
INFO: Run 2, Statement 2: 00:00:00.079772
INFO: Run 3, Statement 1: 00:00:00.073137
INFO: Run 3, Statement 2: 00:00:00.079483
INFO: Run 4, Statement 1: 00:00:00.073456
INFO: Run 4, Statement 2: 00:00:00.081508
INFO: Run 5, Statement 1: 00:00:00.077148
INFO: Run 5, Statement 2: 00:00:00.083535
SQL Server
Preparation:
CREATE TABLE t (
a INT NOT NULL,
b INT NOT NULL
);
WITH s(s) AS (
SELECT 1
UNION ALL
SELECT s + 1 FROM s WHERE s < 100
)
INSERT INTO t
SELECT TOP 100000
row_number() over(ORDER BY (SELECT 1)),
row_number() over(ORDER BY (select 1))
FROM s AS s1, s AS s2, s AS s3;
CREATE INDEX i1 ON t(a);
CREATE INDEX i2 ON t(a, b);
UPDATE STATISTICS t;
Benchmark:
DECLARE @ts DATETIME;
DECLARE @repeat INT = 2000;
DECLARE @r INT;
DECLARE @i INT;
DECLARE @dummy VARCHAR;
DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;
DECLARE @results TABLE (
run INT,
stmt INT,
elapsed DECIMAL
);
SET @r = 0;
WHILE @r < 5
BEGIN
SET @r = @r + 1
SET @s1 = CURSOR FOR
SELECT b FROM t WITH (INDEX (i1)) WHERE a = 1;
SET @s2 = CURSOR FOR
SELECT b FROM t WITH (INDEX (i2)) WHERE a = 1;
SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
SET @i = @i + 1
OPEN @s1;
FETCH NEXT FROM @s1 INTO @dummy;
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXT FROM @s1 INTO @dummy;
END;
CLOSE @s1;
END;
DEALLOCATE @s1;
INSERT INTO @results
VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp));
SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
SET @i = @i + 1
OPEN @s2;
FETCH NEXT FROM @s2 INTO @dummy;
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXT FROM @s2 INTO @dummy;
END;
CLOSE @s2;
END;
DEALLOCATE @s2;
INSERT INTO @results
VALUES (@r, 2, DATEDIFF(ms, @ts, current_timestamp));
END;
SELECT 'Run ' + CAST(run AS VARCHAR) +
', Statement ' + CAST(stmt AS VARCHAR) + ': ' +
CAST(CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(10, 5)) AS VARCHAR)
FROM @results;
Result:
Run 1, Statement 1: 1.22368
Run 1, Statement 1: 1.09211
Run 2, Statement 1: 1.05263
Run 2, Statement 1: 1.09211
Run 3, Statement 1: 1.00000
Run 3, Statement 1: 1.05263
Run 4, Statement 1: 1.05263
Run 4, Statement 1: 1.00000
Run 5, Statement 1: 1.09211
Run 5, Statement 1: 1.05263
As can be seen, predictably, in all databases the smaller non-composite index is slightly faster for this type of query than the composite index. In this particular benchmark, this is specifically true because the composite index acts as a covering index.
Yet both indexes can be used for the query in a reasonable way, so if disk space / insertion speed is an issue, the redundant single-column index can be dropped.
How to find such indexes
The following query will help you detect such indexes in Oracle, PostgreSQL, and SQL Server:
Oracle
WITH indexes AS (
SELECT
i.owner,
i.index_name,
i.table_name,
listagg(c.column_name, ', ')
WITHIN GROUP (ORDER BY c.column_position)
AS columns
FROM all_indexes i
JOIN all_ind_columns c
ON i.owner = c.index_owner
AND i.index_name = c.index_name
GROUP BY i.owner, i.table_name, i.index_name, i.leaf_blocks
)
SELECT
i.owner,
i.table_name,
i.index_name AS "Deletion candidate index",
i.columns AS "Deletion candidate columns",
j.index_name AS "Existing index",
j.columns AS "Existing columns"
FROM indexes i
JOIN indexes j
ON i.owner = j.owner
AND i.table_name = j.table_name
AND j.columns LIKE i.columns || ',%'
Result:
TABLE_NAME delete index columns existing index columns
-------------------------------------------------------------------------
T I1 A I2 A, B
In short, it lists all the indexes whose columns are a prefix of another index’s columns
PostgreSQL
Get ready for a really nifty query. Here’s how to discover redundant indexes in PostgreSQL, which unfortunately doesn’t seem to have an easy, out-of-the-box dictionary view to discover index columns:
WITH indexes AS (
SELECT
tnsp.nspname AS schema_name,
trel.relname AS table_name,
irel.relname AS index_name,
string_agg(a.attname, ', ' ORDER BY c.ordinality) AS columns
FROM pg_index AS i
JOIN pg_class AS trel ON trel.oid = i.indrelid
JOIN pg_namespace AS tnsp ON trel.relnamespace = tnsp.oid
JOIN pg_class AS irel ON irel.oid = i.indexrelid
JOIN pg_attribute AS a ON trel.oid = a.attrelid
JOIN LATERAL unnest(i.indkey)
WITH ORDINALITY AS c(colnum, ordinality)
ON a.attnum = c.colnum
GROUP BY i, tnsp.nspname, trel.relname, irel.relname
)
SELECT
i.table_name,
i.index_name AS "Deletion candidate index",
i.columns AS "Deletion candidate columns",
j.index_name AS "Existing index",
j.columns AS "Existing columns"
FROM indexes i
JOIN indexes j
ON i.schema_name = j.schema_name
AND i.table_name = j.table_name
AND j.columns LIKE i.columns || ',%';
This is a really nice case of lateral unnesting with ordinality, which you should definitely add to your PostgreSQL tool chain.
SQL Server
Now, SQL Server doesn’t have a nice
STRING_AGG
function (
yet), but we can work around this using
STUFF
and
XML
to get the same query.
Of course, there are other solutions using recursive SQL, but I’m too lazy to translate the simple string pattern-matching approach to something recursive.
WITH
i AS (
SELECT
s.name AS schema_name,
t.name AS table_name,
i.name AS index_name,
c.name AS column_name,
ic.key_ordinal AS key_ordinal
FROM sys.indexes i
JOIN sys.index_columns ic
ON i.object_id = ic.object_id
AND i.index_id = ic.index_id
JOIN sys.columns c
ON ic.object_id = c.object_id
AND ic.column_id = c.column_id
JOIN sys.tables t
ON i.object_id = t.object_id
JOIN sys.schemas s
ON t.schema_id = s.schema_id
),
indexes AS (
SELECT
schema_name,
table_name,
index_name,
STUFF((
SELECT ',' + j.column_name
FROM i j
WHERE i.table_name = j.table_name
AND i.index_name = j.index_name
ORDER BY j.key_ordinal
FOR XML PATH('') -- Yay, XML in SQL!
), 1, 1, '') columns
FROM i
GROUP BY schema_name, table_name, index_name
)
SELECT
i.schema_name,
i.table_name,
i.index_name AS "Deletion candidate index",
i.columns AS "Deletion candidate columns",
j.index_name AS "Existing index",
j.columns AS "Existing columns"
FROM indexes i
JOIN indexes j
ON i.schema_name = j.schema_name
AND i.table_name = j.table_name
AND j.columns LIKE i.columns + '%'
AND i.index_name != j.index_name;
A note on partial indexes
SQL Server and PostgreSQL support “partial indexes”, i.e. indexes that contain only parts of your data (
and Oracle can emulate them in various ways). Such indexes might appear in the resulting list – you may want to be careful to check if they’re really redundant or not. Chances are, they’re there for a very good reason.
Conclusion
Now go run the above query on your production database and… Very carefully and reasonably think about whether you really want to drop those indexes ;)
Like this:
Like Loading...
You might want to include OWNER in that Oracle query to avoid any cross-schema naming issues.
Cheers,
Connor
Yeah, you’re right. Fixed.
And if you want to be REALLY careful with oracle there’s the nice feature to set an index INVISIBLE, which doesn’t drop the index, but prevents it from being used in selects, and at the same time it keeps the index updated.
After some time of invisible, when you’re reasonably sure that your production server won’t collapse, you can definitively drop it.
Good point, thanks for pointing this out
+1 for invisible indexes here
I have a very similar example on slides 157-159 here:
https://www.slideshare.net/morgo/mysql-80-optimizer-guide/157
The example query I use is:
“Querying a composite index can be slightly slower as can be seen in the below benchmark”
This is only true for PostgreSQL here. For Oracle the composite index is always faster! For SQL Server its more unclear, when I ran it myself the composite index was faster 3 out of 5 tries. Btw, your result output for SQL Server is quite confusing ;)
Hmm, good point. The benchmark is clearly flawed as the composite index is a covering index for the query, so no heap table access is required. That hardly every reflects a real world situation, though.
Why is the SQL Server output confusing?
There is a small bug in your code, in the SQL Server output it looks like you only measure Statement 1. Of course, its easy to fix by just changing 1 to 2 on line 65.
Oh yes, of course. Thanks!
The postgres query in “How to find such indexes” gives me a “syntax error at or near ‘WITH ORDINALITY’ LINE 13: WITH ORDINALITY AS c(colnum, ordinality)” which I cannot resolve.
Any ideas?
What PostgreSQL version are you using? WITH ORDINALITY was added in 9.4
Yes, that was issue! Works now!
Partial indexes should probably be wholly excluded from these comparisons. In PostgreSQL that would be implemented by limiting to indices where “indpred is null”.
Thank you very much for that feedback. You’re right, they should be treated with care. I’m not sure if they should be wholly excluded as a partial index might make an ordinary index redundant, although the analysis of such redundancy is certainly more complicated.
I’ve added a disclaimer to the article, at the bottom.
The SQL Server query has a couple of problems. One is that it doesn’t consider column order. index_column_id is selected from index_columns for no reason. Instead, select key_ordinal; then, ORDER BY key_ordainl in your XML PATH query for more reliable results.
The query also ignores included columns, so the user should take care that they don’t assume an index is equivalent when it might not be.
Finally, the WHERE clause is too restrictive. Indexes which are exactly the same won’t be matched by this query because “Col1,Col2” is not LIKE “Col1,Col2,%”.
Thanks a lot for your feedback, greatly appreciated. Interesting distinction between index_column_id and key_ordinal. The first works most of the times, but there seems to be a difference when key_ordinal = 0. I’ve fixed the ORDER BY part.
Yes, quite a few things are ignored for simplicity. The query cannot be used for automated replacements, I hope the article was clear enough about this. The queries are just tools to help find such indexes. Partial indexes are also not considered.
That’s an interesting take. Many RDBMS don’t allow “exactly the same” indexes for obvious reasons. I’ve fixed this part, too.
Thanks again for the feedback!
PostgreSQL also permits covering indexes. Tweaked the WHERE clause to be: