- Explore pandemic data with SQL
- Getting started with the data
- Database programming
- Find peak cases
The Times’s data is here.
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
If you’re using a browser, open the context menu (right-click) on that link and use the Save Link As ...
operation to put it into a file on your system. We’ll use it later to load the database.
Use your database interface program to log in to your database server, and figure out which database you’re connected to on your server. Give the command SELECT DATABASE(); /* mysql */
or SELECT current_catalog; /* postgresql */
. If you’re happy with the current database selection, you can proceed. We’ll create some tables and views here.
Create your us_counties
table. (Notice the underscore rather than the hyphen in the name.) You can use this DDL for that.
CREATE TABLE us_counties ( date DATE NOT NULL, county VARCHAR(50) NOT NULL, state VARCHAR(50) NOT NULL, fips INT NULL DEFAULT NULL, cases INT NULL DEFAULT NULL, deaths INT NULL DEFAULT NULL, PRIMARY KEY (date, state, county) );
Load the data into the table. Your database interface program may have an Import from CSV feature, or you can use a database command like this one.
TRUNCATE us_counties; LOAD DATA LOCAL INFILE 'F:\\Downloads\\us-counties.txt' INTO TABLE us_counties CHARACTER SET utf8 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES (date, county, state, @fips, @cases, @deaths) SET fips = IF(LENGTH(@fips)>0,@fips,NULL), cases = IF(LENGTH(@cases)>0,@cases,NULL), deaths = IF(LENGTH(@deaths)>0,@deaths,NULL);
Now, we can run queries on this data. For example, running this query
SELECT date, county, state, fips, cases, deaths FROM us_counties WHERE fips = 25009 AND date >= '2020-04-18' AND date < '2020-04-22' ORDER BY date;
should get us a result set like this one.
Date | County | State | FIPS | Cases | Deaths |
2020-04-18 | Essex | Massachusetts | 25009 | 4245 | 160 |
2020-04-19 | Essex | Massachusetts | 25009 | 5153 | 210 |
2020-04-20 | Essex | Massachusetts | 25009 | 5296 | 225 |
2020-04-21 | Essex | Massachusetts | 25009 | 5521 | 245 |
We can also search for other things. For example, this query finds the place with the largest number of deaths for each case — the largest death rate. Let’s exclude counties with less than 1000 cases when doing this. Also, we should exclude Unknown counties; they are just holding places for cases that later get assigned to a real county.
SELECT date, state, county, fips, cases, deaths, deaths/cases rate FROM us_counties WHERE county <> 'Unknown' AND date >= '2020-11-01' AND cases > 1000 ORDER BY deaths/cases DESC LIMIT 1;
The New York Times case and death numbers are cumulative. That is, they record the totals. So we must compare consecutive days to find out the new cases and deaths reported on any given day. That can be done like this with a self-join. This query shows the pandemic’s (horrific) toll in Los Angeles since 1-Nov-2020.
SELECT today.date, DAYNAME(today.date) weekday, today.state, today.county, today.fips, today.cases - yesterday.cases newcases, today.deaths - yesterday.deaths newdeaths FROM us_counties today JOIN us_counties yesterday ON today.date = yesterday.date + INTERVAL 1 DAY AND today.state = yesterday.state AND today.county = yesterday.county WHERE today.date >= '2020-11-01' AND today.county = 'Los Angeles' AND today.state = 'California' ORDER BY today.date;
This result set highlights a quirk in the data. The weekend toll is extra low each week, and the Monday and Tuesday toll is extra high. That’s probably because some cases and deaths over the weekends. If we’re going to make sense of trends, though, we need to smooth out the data. This query reports a one-week moving average for cases and deaths. It uses SQL window functions to do the averaging, with a window frame covering seven days.
SELECT date, WEEKDAY(date) weekday, county, state, fips, cases, AVG(cases) OVER ( PARTITION BY state, county ORDER BY date ROWS BETWEEN 6 PRECEDING AND 0 FOLLOWING) as cases7, deaths, AVG(deaths) OVER ( PARTITION BY state, county ORDER BY date ROWS BETWEEN 6 PRECEDING AND 0 FOLLOWING) as deaths7 FROM us_counties WHERE date >= '2020-11-01' AND county = 'Los Angeles' AND state = 'California' ORDER BY date
Try playing around with other queries to get a feel for the data. Exercise: get a result set with daily new cases for New York City for 15-Apr-2020 to 15-May-2020.
In the next part of this tutorial we’ll start doing some database programming. We’ll create and use some views of the data.