Filling in missing data with sequences of cardinal integers

When you’re extracting summaries of information from a database, sometimes rows are missing.  For example, if you are keeping track of number of customers by day, you might use a query like this:

You might get a result set like this:

2014-03-07 122
2014-03-08 355
2014-03-10 234
2014-03-11 119

Notice that the row for 2014-03-09 is missing from this sample result set. That’s a Sunday, and there were no visits because the shop is closed. That means the SQL summary omits that row from the summary.

What do you do about this if you want the result set, with the row included, like so?  You need to figure out a way to include it.

2014-03-07 122
2014-03-08 355
2014-03-09   0
2014-03-10 234
2014-03-11 119

You need to figure out a way to get a virtual table containing all the dates in the appropriate range, and then join them to the summary. Here’s a query that will get the dates in the range.

What’s going on here? Two things. First, we have a subquery which determines the first and last day (min and max time_of_visit) we care about reporting.

Second, we have a table called seq_0_to_999999. It contains a sequence of cardinal numbers: the integers starting at zero. More about this in a moment.

Joining these two tables together, then using the expression

mintime + INTERVAL seq.seq DAY AS shopdate

gives us a nice sequence of days. Finally, we can join that whole subquery to our original summary query, like this.

That looks like a complex query. But it is in fact a nesting of some simpler queries.

The sequence of cardinal integers

Finally, what about this sequence table? Where do we get those integers starting with zero?  The answer is this: we have to arrange to do that; those numbers aren’t built in to MySQL. They are built into the fork called MariaDB; read on for that.

The simple way is to create a table with a whole lot of integers in it.  That will take up storage, though.

Another way is to create a short table with the integers from 0-9 in it, like so:

Then we can create a view that join that table with itself to generate 1000 combinations like this:

Finally, we can join that table of 1000 numbers with itself to create a view that will generate a million combinations like this:

There is our sequence table. We have to be careful using this, because it only generates a million integers. That’s likely no problem if you’re summarizing data day-by-day. But if your’re working with seconds, it may not be enough: there are 31 536 000 seconds in a year. If you’re in that situation, you should make yourself another view that contains more integers. Be careful here: with modern computing equipment, a million-integer table performs tolerably well, but a billion-integer table doesn’t; it takes a couple of minutes to enumerate.

What if you can’t create tables or views?

Maybe you won’t be able to create tables and views in your RDMS for these cardinal integers. You may be developing an add-on query to a system in which you have no control over the schema, for example. In this case you need to figure out how to include your cardinal-number sequence as a nested subquery in the rest of your query. That’s also fairly simple, even if it makes for a longer query.

This subquery generates 5**11 sequence numbers (just under 49 million) and takes about a minute and a half. That’s enough numbers for a year’s worth of seconds.

If you don’t need that many, you can use this query to generate 5**9 numbers (a little over 1.9 million) in under four seconds.

The cardinal numbers in MariaDB 10.0.3 and better

In Version 10.0.3 or higher of the MySQL fork called MariaDB, there’s a storage engine known as the Sequence Storage Engine. Its purpose is to generate sequences of cardinal numbers efficiently. Using it is easy.

SELECT seq FROM seq_0_to_999999

retrieves the integers 0-999999. The trick here is the table name, which specifies the starting and ending point of the sequence of cardinal numbers. You do not have to create the sequence table: you can just use it. (And you cannot name other tables that way). Notice that you can also give a query like this, so you can save the trouble of using a table name that specifies precisely the end of your sequence.

SELECT seq FROM seq_0_to_999999 WHERE seq < 10000

That will also limit the number of items in your sequence, just as the examples above show.

MariaDB user Federico Razzoli described the Sequence Storage Engine here.

4 thoughts on “Filling in missing data with sequences of cardinal integers”

Leave a Comment