- Explore pandemic data with SQL
The New York Times newspaper publishes, every day, a file showing COVID-19 cases and fatalities for each county in the USA. They put their file on github here for anyone to use. So, let’s use it. Let’s download it and learn a few things about using SQL to explore data as we dig into it.
Their file is called
us-counties.csv. It’s a comma-separated-values text file. If we download it we can open and view it with Microsoft Excel, Libre Office Calc, or any other spreadsheet program. On doing that we see that the file contains one line — in database jargon, one row — for each date and county in the US. Here are some example lines from April 2020 in Essex County, Massachusetts, where I live.
As you can see, on those days the coronavirus was raging where I live. On April 19th for example, the public health department reported 108 new cases and 50 new deaths, bringing the total cases from 4245 to 5153, and the deaths from 160 to 210. By using a spreadsheet program and finding a small sample of rows, we can eyeball the data like I did in this example.
But the file contains, as of late June 2020, well over a quarter-million rows like the four shown here. We’d soon tire of eyeballing the data, especially if we were looking for trends or hotspots. That’s where SQL comes in.
What’s in this data? Six columns per row. Let’s take a look.
- Date: the date of the observation. Dates are formatted 2020-04-20, or year-month-day. That’s good: it’s a database-friendly format.
- County: the name of the county.
- State: the name of the state.
- FIPS: This five-digit number is a Federal Information Processing Standard number for each county. Essex is the 9th county and Massachusetts is the 25th state in that numbering system so we have the FIPS code 25009 here. This number may come in handy in the future if we want to download and use census data. So we’ll just keep it in place.
- Cases: the cumulative number of COVID-19 cases reported by public health authorities. This number keeps going up: it doesn’t count the number of people presently ill, just the number who are or have been ill. And, obviously it doesn’t include the number of people who got ill but didn’t seek medical attention or a test.
- Deaths: the cumulative number of deaths due to COVID-19 reported by public health authorities.
When exploring data like this, it’s smart to be a little skeptical. Let’s understand where it came from and how it is collected. The data-journalists at the Times publish a statement, here, telling us where they got it. Is the data somehow biased? Yes, of course it is. It doesn’t tell us how many people recovered, for one thing. For another, it depends on the way states’ departments of public health count cases. Massachusetts and New Hampshire may use subtly different criteria for counting cases. So the wise data explorer doesn’t draw stark conclusions from data like this, but instead uses it to suggest trends worth investigating.
Let’s get busy exploring this data with SQL. First, if you haven’t yet done so, you should download a database program and a database user-interface program to your computer. Most laptops and desktops made since 2010 or so can handle this amount of data easily, so there’s very likely no need to use a server computer or buy anything new. What you need is a database server, a program that can upload the
us-counties.csv file to it, and a program that lets you enter SQL queries and see the results.
I like to use the MariaDB fork of MySQL. It comes with a nice user-interface program called HeidiSQL to upload files and run SQL. You can download it from here. You can also use MySQL itself, from here. It comes with MySQL Workbench, another nice user-interface program. Install one of those, and follow its “getting started” guide to get things working.