Chapter 3 Data transformation

3.1 CPI data

The cpi data is accessed through cdc website api via googlesheets4::read_sheet. We firstly add the item names to our main data of interest by matching corresponding seriesID. Since all attributes are in the string format, it’s helpful to convert the item names into the factor type since we are interested in comparing time series for different items and cpi values into numeric values. We also create a new date attribute by concetenating year and month to get the full range of time series, which is further converted into the yearmon type. Finally, we create another data cpi_monthly.csv by pivot_wider to obtain monthly cpi values for all item names so that the number of observations are equivalent to the number of distinct year-months. This data will be used to merge with covid data below.

## # A tibble: 6 × 7
##   year  period periodName value seriesID    item_name date     
##   <chr> <chr>  <chr>      <dbl> <chr>       <fct>     <yearmon>
## 1 2021  M10    October     155. SUUR0000SA0 All items Oct 2021 
## 2 2021  M09    September   154. SUUR0000SA0 All items Sep 2021 
## 3 2021  M08    August      154. SUUR0000SA0 All items Aug 2021 
## 4 2021  M07    July        153. SUUR0000SA0 All items Jul 2021 
## 5 2021  M06    June        153. SUUR0000SA0 All items Jun 2021 
## 6 2021  M05    May         151. SUUR0000SA0 All items May 2021

3.1.1 Travel

For the monthly transportation statistics, we only extract the Date and U.S. Airline Traffic - Total - Seasonally Adjusted for our primary interest in transportation, where we convert the dates into the yearmon type and remove all NAs so that we are left with air passengers from Jan 2017 to Jul 2021.

## # A tibble: 6 × 136
##   Index Date   `Air Safety - Gen… `Highway Fatalitie… `Highway Fatali… `U.S. Airline Tr… `U.S. Airline Tra… `U.S. Airline Tra…
##   <dbl> <chr>               <dbl>               <dbl>            <dbl>             <dbl>              <dbl>              <dbl>
## 1     0 01/01…                 NA                  NA               NA                NA                 NA                 NA
## 2     1 02/01…                 NA                  NA               NA                NA                 NA                 NA
## 3     2 03/01…                 NA                  NA               NA                NA                 NA                 NA
## 4     3 04/01…                 NA                  NA               NA                NA                 NA                 NA
## 5     4 05/01…                 NA                  NA               NA                NA                 NA                 NA
## 6     5 06/01…                 NA                  NA               NA                NA                 NA                 NA
## # … with 128 more variables: Transit Ridership - Other Transit Modes - Adjusted <dbl>,
## #   Transit Ridership - Fixed Route Bus - Adjusted <dbl>, Transit Ridership - Urban Rail - Adjusted <dbl>,
## #   Freight Rail Intermodal Units <dbl>, Freight Rail Carloads <dbl>, Highway Vehicle Miles Traveled - All Systems <dbl>,
## #   Highway Vehicle Miles Traveled - Total Rural <dbl>, Highway Vehicle Miles Traveled - Other Rural <dbl>,
## #   Highway Vehicle Miles Traveled - Rural Other Arterial <dbl>, Highway Vehicle Miles Traveled - Rural Interstate <dbl>,
## #   State and Local Government Construction Spending - Breakwater/Jetty <dbl>,
## #   State and Local Government Construction Spending - Dam/Levee <dbl>, …

3.1.2 Oil

For oil data, we simply remove the observations prior to Jan 2017 to match the transportation data above and convert the dates into the yearmon type as discussed in section 2.

3.2 Covid data

3.2.1 United States COVID-19 Cases and Deaths by State overTime

Observe that covid data are daily based and cpi data are monthly based, we use dplyr to shrink covid data to monthly based by summing the number of cases and deaths for every state, where we denote this new dataset cdcmonth.csv after adjusting date into the yearmon format. We also noticed that for the attributes consent_cases andconsent_deaths some NAs are denoted as the string 'N\A', where it makes sense to replace them with actual NA since they are indeed missing values. In addition, we want to investigate cross state covid cases, where we want to focus on 50 main states. In particular, we have 60 states in total and the reported observations for New York City and Washington D.C. are separated from New York and Maryland respectively, where other states are disjoint from the 50. Hence, we adjust the observations for New York City and Washington D.C. and add them to New York and Maryland in cdcmonth.csv respectively. Furthermore, we replace the NAs in probable cases and deaths with 0s and further replace confirmed cases and deaths with total cases - probable cases and total deaths - probable deaths since some of the confirmed cases and deaths are < 0 as mentioned in section 2 and probable cases are considered to be unreliable records. We have also created 3 new datasets cdc_all.csv, cdc_monthly_all_state.csv and data_monthly.csv, where cdc_all.csv is obtained from getting the daily reported cases and deaths for the entire US by summing over the cases and death across all the states, then we further shrink the data to the national monthly cases and deaths cdc_monthly_all_state.csv by grouping the data by each month. We eventually merge cdc_monthly_all_state.csv and cpi_monthly.csv obtained above to conduct analysis between covid and cpi attributes as well as performing visualizations such as plotting us maps.

## # A tibble: 6 × 5
##   new_case new_death date    state statefull
##      <dbl>     <dbl> <chr>   <chr> <chr>    
## 1      227         3 2020-4  AK    Alaska   
## 2     4871         1 2021-4  AK    Alaska   
## 3     2297        19 2020-8  AK    Alaska   
## 4    12078        49 2021-8  AK    Alaska   
## 5    13978        76 2020-12 AK    Alaska   
## 6        0         0 2020-2  AK    Alaska

##   new_case new_death     date state statefull
## 1      227         3 Apr 2020    AK    Alaska
## 2     4871         1 Apr 2021    AK    Alaska
## 3     2297        19 Aug 2020    AK    Alaska
## 4    12078        49 Aug 2021    AK    Alaska
## 5    13978        76 Dec 2020    AK    Alaska
## 6        0         0 Feb 2020    AK    Alaska