Handling Github files in R

An example utilizing COVID-19 data from John Hopkins

Favia

John Hopkins COVID-19 Data

Given some recent updates to the way John Hopkins is adjusting their data formatting regarding COVID-19 (read more here) the method of reading updated data has gotten easier. This also makes utilizing something like the coronavirus: The 2019 Novel Coronavirus COVID-19 (2019-nCoV) Dataset for accessing up-to-date data more attractive, as you’ll be gaining much of the same results albeit without the time series John Hopkins is still making available.

When the Daily Reports were being reported it required some management to maintain up-to-date access to the data. I found figuring this out to be something that may come in handy for future endeavors and so I hope to detail the steps here.

This convenience update did come at a cost, however, as recovered cases are no longer being tracked. However, as of this post they are still being tracked by Dailies meaning you could utilize the methods below to access this data as they seem committed to keeping old formats archived.

The Process to handle files from a Github repo

The steps are fairly simple and utilize three packages:

library(dplyr) # Handles Grammar of Data Manipulation
library(readr) # Handles importing excel file formats into R
library(stringr) # Handles management of file names from github

The first few steps are your basic file management. We want to define our working directory, the URL the repo will download from, and the directory the data will be unpackaged at.

# Assigning my directory
setwd("~/GitHub/brandonpipher.com/content/post")
# Repo URL, found by copying link from "Clone or Download"
githubloc <-
  "https://github.com/CSSEGISandData/COVID-19/archive/master.zip"
# Where the repo will be downloaded and unpackaged at in the working directory
datadir <- "COVID-19"
destfile <- paste(datadir,"master.zip",sep = "/")

Adjust the above according to your own workflow.

Downloading the file and unpackaging it is also straightforward.

# Downloading repo to destination
download.file(url = githubloc, destfile = destfile)
# Unzipping downloaded rip to destination of the 'data' folder
unzip(zipfile = destfile, exdir = datadir)

Something worth considering is a check to ensure you aren’t re-downloading more than necessary.

In a reactive environment a function such as the following might be handy, where we can calculate the time passed since the previous download.

minutesSince.DL = function(fileName) {
  (as.numeric(as.POSIXlt(Sys.time())) -  
   as.numeric(file.info(fileName)$ctime)) / 60
}

From here most of the work regarding inital data update has been completed. The next step is accessing the data into R.

# Manually adding the folder names from the repo to determine the data location
covid.dailies.loc <-
  paste(
    datadir,
    "COVID-19-master",
    "csse_covid_19_data/csse_covid_19_daily_reports",
    sep = "/"
  )
covid.dailies.loc
## [1] "COVID-19/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports"
# Creating a list of available files, but ignoring the README
covid.dailies <-
  list.files(covid.dailies.loc)[!grepl("README", list.files(covid.dailies.loc))]
covid.dailies
##  [1] "01-22-2020.csv" "01-23-2020.csv" "01-24-2020.csv" "01-25-2020.csv"
##  [5] "01-26-2020.csv" "01-27-2020.csv" "01-28-2020.csv" "01-29-2020.csv"
##  [9] "01-30-2020.csv" "01-31-2020.csv" "02-01-2020.csv" "02-02-2020.csv"
## [13] "02-03-2020.csv" "02-04-2020.csv" "02-05-2020.csv" "02-06-2020.csv"
## [17] "02-07-2020.csv" "02-08-2020.csv" "02-09-2020.csv" "02-10-2020.csv"
## [21] "02-11-2020.csv" "02-12-2020.csv" "02-13-2020.csv" "02-14-2020.csv"
## [25] "02-15-2020.csv" "02-16-2020.csv" "02-17-2020.csv" "02-18-2020.csv"
## [29] "02-19-2020.csv" "02-20-2020.csv" "02-21-2020.csv" "02-22-2020.csv"
## [33] "02-23-2020.csv" "02-24-2020.csv" "02-25-2020.csv" "02-26-2020.csv"
## [37] "02-27-2020.csv" "02-28-2020.csv" "02-29-2020.csv" "03-01-2020.csv"
## [41] "03-02-2020.csv" "03-03-2020.csv" "03-04-2020.csv" "03-05-2020.csv"
## [45] "03-06-2020.csv" "03-07-2020.csv" "03-08-2020.csv" "03-09-2020.csv"
## [49] "03-10-2020.csv" "03-11-2020.csv" "03-12-2020.csv" "03-13-2020.csv"
## [53] "03-14-2020.csv" "03-15-2020.csv" "03-16-2020.csv" "03-17-2020.csv"
## [57] "03-18-2020.csv" "03-19-2020.csv" "03-20-2020.csv" "03-21-2020.csv"
## [61] "03-22-2020.csv" "03-23-2020.csv" "03-24-2020.csv"

And from here we’ve gained access to all of the files in a method that remains up-to-date as additions are made. If we wanted to start to access these files by specifically tracking the range of available dates we can just use stringr to gain the vector of dates:

covid.dailies.dates <- covid.dailies %>% str_remove(".csv")
covid.dailies.dates
##  [1] "01-22-2020" "01-23-2020" "01-24-2020" "01-25-2020" "01-26-2020"
##  [6] "01-27-2020" "01-28-2020" "01-29-2020" "01-30-2020" "01-31-2020"
## [11] "02-01-2020" "02-02-2020" "02-03-2020" "02-04-2020" "02-05-2020"
## [16] "02-06-2020" "02-07-2020" "02-08-2020" "02-09-2020" "02-10-2020"
## [21] "02-11-2020" "02-12-2020" "02-13-2020" "02-14-2020" "02-15-2020"
## [26] "02-16-2020" "02-17-2020" "02-18-2020" "02-19-2020" "02-20-2020"
## [31] "02-21-2020" "02-22-2020" "02-23-2020" "02-24-2020" "02-25-2020"
## [36] "02-26-2020" "02-27-2020" "02-28-2020" "02-29-2020" "03-01-2020"
## [41] "03-02-2020" "03-03-2020" "03-04-2020" "03-05-2020" "03-06-2020"
## [46] "03-07-2020" "03-08-2020" "03-09-2020" "03-10-2020" "03-11-2020"
## [51] "03-12-2020" "03-13-2020" "03-14-2020" "03-15-2020" "03-16-2020"
## [56] "03-17-2020" "03-18-2020" "03-19-2020" "03-20-2020" "03-21-2020"
## [61] "03-22-2020" "03-23-2020" "03-24-2020"

From here utilizing covid.dailies.dates we can access the data according to available dates. As an example:

# Pulling an example of the most recent dataset
example.input <- covid.dailies.dates[length(covid.dailies.dates)]
example.input
## [1] "03-24-2020"
# The example dataset
example.data <- read_csv(paste(covid.dailies.loc, paste0(example.input,'.csv'), sep = "/"))

And from here we could imagine example.input as being assigned reactively, and so we’ve created a way of managing the data from Github (or any other public URL, really) in real-time.

example.data %>% 
  group_by(`Country_Region`) %>% 
  summarize(Total = sum(Confirmed)) %>%
  arrange(desc(Total))
## # A tibble: 169 x 2
##    Country_Region Total
##    <chr>          <dbl>
##  1 China          81591
##  2 Italy          69176
##  3 US             53740
##  4 Spain          39885
##  5 Germany        32986
##  6 Iran           24811
##  7 France         22622
##  8 Switzerland     9877
##  9 Korea, South    9037
## 10 United Kingdom  8164
## # ... with 159 more rows

How to access the data Post-Update

Coming back to the present, remember there were updates made simplifying this process. No longer would anyone need to access the files from the dailies and manually wrangle the data.

Instead one just goes to the Github Repo and access is as easy as pulling the files:

time_series_covid19_confirmed_global.csv time_series_covid19_deaths_global.csv

Or, if you want to do it straight from a package hosted on CRAN and aren’t concerned about time-series data:

library(coronavirus)
head(coronavirus)
## # A tibble: 6 x 7
##   Province.State Country.Region   Lat  Long date       cases type     
##   <chr>          <chr>          <dbl> <dbl> <date>     <int> <chr>    
## 1 ""             Japan           35.7  140. 2020-01-22     2 confirmed
## 2 ""             South Korea     37.6  127. 2020-01-22     1 confirmed
## 3 ""             Thailand        13.8  101. 2020-01-22     2 confirmed
## 4 Anhui          Mainland China  31.8  117. 2020-01-22     1 confirmed
## 5 Beijing        Mainland China  40.2  116. 2020-01-22    14 confirmed
## 6 Chongqing      Mainland China  30.1  108. 2020-01-22     6 confirmed

More info about the coronavirus package can be found here.