Getting started with the data - Plum Island Media

This entry is part 2 of 4 in the series Exploring data with SQL

The Times’s data is here.

https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

If you’re using a browser, open the context menu (right-click) on that link and use the Save Link As ... operation to put it into a file on your system. We’ll use it later to load the database.

Use your database interface program to log in to your database server, and figure out which database you’re connected to on your server. Give the command SELECT DATABASE(); /* mysql */ or SELECT current_catalog; /* postgresql */. If you’re happy with the current database selection, you can proceed. We’ll create some tables and views here.

Create your us_counties table. (Notice the underscore rather than the hyphen in the name.) You can use this DDL for that.

CREATE TABLE us_counties (
  date DATE NOT NULL,
  county VARCHAR(50) NOT NULL,
  state VARCHAR(50) NOT NULL,
  fips INT NULL DEFAULT NULL,
  cases INT NULL DEFAULT NULL,
  deaths INT NULL DEFAULT NULL,
  PRIMARY KEY (date, state, county)
);

Load the data into the table. Your database interface program may have an Import from CSV feature, or you can use a database command like this one.

TRUNCATE us_counties;
LOAD DATA LOCAL INFILE 'F:\\Downloads\\us-counties.txt' 
     INTO TABLE us_counties 
     CHARACTER SET utf8
     FIELDS TERMINATED BY ','
     OPTIONALLY ENCLOSED BY '"'
     LINES TERMINATED BY '\n'
     IGNORE 1 LINES
     (date, county, state, @fips, @cases, @deaths)
     SET fips = IF(LENGTH(@fips)>0,@fips,NULL),
         cases = IF(LENGTH(@cases)>0,@cases,NULL),
         deaths = IF(LENGTH(@deaths)>0,@deaths,NULL);

Now, we can run queries on this data. For example, running this query

SELECT date, county, state, fips, cases, deaths
  FROM us_counties
 WHERE fips = 25009
   AND date >= '2020-04-18'
   AND date <  '2020-04-22'  
 ORDER BY date;

This query produces a result set like this:

Date	County	State	FIPS	Cases	Deaths
2020-04-18	Essex	Massachusetts	25009	4245	160
2020-04-19	Essex	Massachusetts	25009	5153	210
2020-04-20	Essex	Massachusetts	25009	5296	225
2020-04-21	Essex	Massachusetts	25009	5521	245

We can also search for other things. For example, this query finds the place with the largest number of deaths for each case — the largest death rate. Let’s exclude counties with less than 1000 cases when doing this. Also, we should exclude Unknown counties; they are just holding places for cases where the county was not yet known.

SELECT date, state, county, fips, cases, deaths, deaths/cases rate
  FROM us_counties
 WHERE county <> 'Unknown'
   AND date >= '2020-11-01'
   AND cases > 1000
 ORDER BY deaths/cases DESC
 LIMIT 1;

The New York Times case and death numbers are cumulative. That is, they record the totals. So we must compare consecutive days to find out the new cases and deaths reported on any given day. A self-join works to do this.. This query shows the pandemic’s (horrific) toll in Los Angeles since 1-Nov-2020.

SELECT today.date, DAYNAME(today.date) weekday,
       today.state, today.county, today.fips, 
       today.cases - yesterday.cases newcases,
       today.deaths - yesterday.deaths newdeaths
  FROM us_counties today
  JOIN us_counties yesterday
         ON today.date = yesterday.date + INTERVAL 1 DAY
        AND today.state = yesterday.state
        AND today.county = yesterday.county
 WHERE today.date >= '2020-11-01'
   AND today.county = 'Los Angeles'
   AND today.state = 'California'
 ORDER BY today.date;

This result set highlights a quirk in the data. The weekend toll is extra low each week, and the Monday and Tuesday toll is extra high. That’s probably because some cases and deaths over the weekends. If we’re going to make sense of trends, though, we need to smooth out the data. This query reports a one-week moving average for cases and deaths. It uses SQL window functions to do the averaging, with a window frame covering seven days.

SELECT date, WEEKDAY(date) weekday,
       county, state, fips,
       cases, 
       AVG(cases) OVER (
               PARTITION BY state, county 
                   ORDER BY date
               ROWS BETWEEN 6 PRECEDING
                        AND 0 FOLLOWING) as cases7,
       deaths,
       AVG(deaths) OVER (
               PARTITION BY state, county 
                   ORDER BY date
               ROWS BETWEEN 6 PRECEDING
                       AND 0 FOLLOWING) as deaths7
  FROM us_counties
  WHERE date >= '2020-11-01'
    AND county = 'Los Angeles'
    AND state = 'California'
  ORDER BY date

Try playing around with other queries to get a feel for the data. Exercise: get a result set with daily new cases for New York City for 15-Apr-2020 to 15-May-2020.

In the next part of this tutorial we’ll start doing some database programming. We’ll create and use some views of the data.

Exploring data with SQL

Leave a Comment Cancel reply