- Explore pandemic data with SQL
- Getting started with the data
- Database programming
- Find peak cases
The Times’s data is here.
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
If you’re using a browser, open the context menu (right-click) on that link and use the Save Link As ...
operation to put it into a file on your system. We’ll use it later to load the database.
Use your database interface program to log in to your database server, and figure out which database you’re connected to on your server. Give the command SELECT DATABASE(); /* mysql */
or SELECT current_catalog; /* postgresql */
. If you’re happy with the current database selection, you can proceed. We’ll create some tables and views here.
Create your us_counties
table. (Notice the underscore rather than the hyphen in the name.) You can use this DDL for that.
CREATE TABLE us_counties (
date DATE NOT NULL,
county VARCHAR(50) NOT NULL,
state VARCHAR(50) NOT NULL,
fips INT NULL DEFAULT NULL,
cases INT NULL DEFAULT NULL,
deaths INT NULL DEFAULT NULL,
PRIMARY KEY (date, state, county)
);
Load the data into the table. Your database interface program may have an Import from CSV feature, or you can use a database command like this one.
TRUNCATE us_counties;
LOAD DATA LOCAL INFILE 'F:\\Downloads\\us-counties.txt'
INTO TABLE us_counties
CHARACTER SET utf8
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(date, county, state, @fips, @cases, @deaths)
SET fips = IF(LENGTH(@fips)>0,@fips,NULL),
cases = IF(LENGTH(@cases)>0,@cases,NULL),
deaths = IF(LENGTH(@deaths)>0,@deaths,NULL);
Now, we can run queries on this data. For example, running this query
SELECT date, county, state, fips, cases, deaths
FROM us_counties
WHERE fips = 25009
AND date >= '2020-04-18'
AND date < '2020-04-22'
ORDER BY date;
should get us a result set like this one.
Date | County | State | FIPS | Cases | Deaths |
2020-04-18 | Essex | Massachusetts | 25009 | 4245 | 160 |
2020-04-19 | Essex | Massachusetts | 25009 | 5153 | 210 |
2020-04-20 | Essex | Massachusetts | 25009 | 5296 | 225 |
2020-04-21 | Essex | Massachusetts | 25009 | 5521 | 245 |
We can also search for other things. For example, this query finds the place with the largest number of deaths for each case — the largest death rate. Let’s exclude counties with less than 1000 cases when doing this. Also, we should exclude Unknown counties; they are just holding places for cases that later get assigned to a real county.
SELECT date, state, county, fips, cases, deaths, deaths/cases rate
FROM us_counties
WHERE county <> 'Unknown'
AND date >= '2020-11-01'
AND cases > 1000
ORDER BY deaths/cases DESC
LIMIT 1;
The New York Times case and death numbers are cumulative. That is, they record the totals. So we must compare consecutive days to find out the new cases and deaths reported on any given day. That can be done like this with a self-join. This query shows the pandemic’s (horrific) toll in Los Angeles since 1-Nov-2020.
SELECT today.date, DAYNAME(today.date) weekday,
today.state, today.county, today.fips,
today.cases - yesterday.cases newcases,
today.deaths - yesterday.deaths newdeaths
FROM us_counties today
JOIN us_counties yesterday
ON today.date = yesterday.date + INTERVAL 1 DAY
AND today.state = yesterday.state
AND today.county = yesterday.county
WHERE today.date >= '2020-11-01'
AND today.county = 'Los Angeles'
AND today.state = 'California'
ORDER BY today.date;
This result set highlights a quirk in the data. The weekend toll is extra low each week, and the Monday and Tuesday toll is extra high. That’s probably because some cases and deaths over the weekends. If we’re going to make sense of trends, though, we need to smooth out the data. This query reports a one-week moving average for cases and deaths. It uses SQL window functions to do the averaging, with a window frame covering seven days.
SELECT date, WEEKDAY(date) weekday,
county, state, fips,
cases,
AVG(cases) OVER (
PARTITION BY state, county
ORDER BY date
ROWS BETWEEN 6 PRECEDING
AND 0 FOLLOWING) as cases7,
deaths,
AVG(deaths) OVER (
PARTITION BY state, county
ORDER BY date
ROWS BETWEEN 6 PRECEDING
AND 0 FOLLOWING) as deaths7
FROM us_counties
WHERE date >= '2020-11-01'
AND county = 'Los Angeles'
AND state = 'California'
ORDER BY date
Try playing around with other queries to get a feel for the data. Exercise: get a result set with daily new cases for New York City for 15-Apr-2020 to 15-May-2020.
In the next part of this tutorial we’ll start doing some database programming. We’ll create and use some views of the data.