Clustered Indexes and efficiency

The primary key of every table has an index with it. In the InnoDB storage engine that index is the table: that’s what clustered index means. That is, the on-disk structure of the index contains all the other column values. There is no separate on-disk structure for the table’s data, rather it is all in the primary key’s index — the clustered index. (There are separate on-disk structures for TEXT and BLOB columns though.)

So, if you use this shape of query all the time to get some data …

SELECT * FROM tbl WHERE cust_id = ? AND order_date >= ?

your table should, if possible, have this structure.

CREATE TBL tbl (
  order_id BIGINT NOT NULL AUTO_INCREMENT,
  cust_id BIGINT NOT NULL,
  order_date DATE,
  this VARCHAR(50),
  that VARCHAR(50),
  PRIMARY KEY (cust_id, order_date, order_id),
  UNIQUE INDEX (order_id)
)

Notice how the first two columns of the primary key are the two colums referred to in the common SELECT statement. The third column is the auto-increment id of the table, and it is there to ensure that the composite primary key values are unique. There’s an additional unique index on the auto-increment id column, to make sure it stays unique.

The WHERE cust_id = ? AND order_date >= ? filter clause in our query above exploits the multi-column BTREE index that happens to be the primary key. It random-access the index for the first eligible row, then sequentially scans the index to the last eligible row. This is as good as this shape of query gets performance-wise.

But it gets even better. We’re scanning the clustered index that implements the primary key. So, as we scan each index entry we have direct access to the values of all the columns, so there’s no need to hit some other on-disk structure.

In contrast, the following conventional way of defining this table makes the SELECT statement just a bit slower because it has to hit the — ordinary non-clustered — index, and then bounce to the clustered index on the order-id.

CREATE TBL tbl (    /* A bit slower than optimal. */
  order_id BIGINT NOT NULL AUTO_INCREMENT,
  cust_id BIGINT NOT NULL,
  order_date DATE,
  this VARCHAR(50),
  that VARCHAR(50),
  PRIMARY KEY (order_id),
  INDEX cust_id_order_date (cust_id, order_date)
)

You will recognize this as the conventional way of defining a table, with an auto-incrementing surrogate primary key and a separate index supporting the common query pattern.

The legacy MyISAM storage engine in MySQL and MariaDB does not use these clustered indexes. Neither does PostgreSQL. Microsoft SQL server does use them.