Chapter 3 Data transformation
3.1 CPI data
The cpi data is accessed through cdc website api via googlesheets4::read_sheet
. We firstly add the item names to our main data of interest by matching corresponding seriesID. Since all attributes are in the string format, it’s helpful to convert the item names into the factor type since we are interested in comparing time series for different items and cpi values into numeric values. We also create a new date attribute by concetenating year and month to get the full range of time series, which is further converted into the yearmon type. Finally, we create another data cpi_monthly.csv
by pivot_wider
to obtain monthly cpi values for all item names so that the number of observations are equivalent to the number of distinct year-months. This data will be used to merge with covid data below.
## # A tibble: 6 × 7
## year period periodName value seriesID item_name date
## <chr> <chr> <chr> <dbl> <chr> <fct> <yearmon>
## 1 2021 M10 October 155. SUUR0000SA0 All items Oct 2021
## 2 2021 M09 September 154. SUUR0000SA0 All items Sep 2021
## 3 2021 M08 August 154. SUUR0000SA0 All items Aug 2021
## 4 2021 M07 July 153. SUUR0000SA0 All items Jul 2021
## 5 2021 M06 June 153. SUUR0000SA0 All items Jun 2021
## 6 2021 M05 May 151. SUUR0000SA0 All items May 2021
3.1.1 Travel
For the monthly transportation statistics, we only extract the Date
and U.S. Airline Traffic - Total - Seasonally Adjusted
for our primary interest in transportation, where we convert the dates into the yearmon
type and remove all NA
s so that we are left with air passengers from Jan 2017 to Jul 2021.
## # A tibble: 6 × 136
## Index Date `Air Safety - Gen… `Highway Fatalitie… `Highway Fatali… `U.S. Airline Tr… `U.S. Airline Tra… `U.S. Airline Tra…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 01/01… NA NA NA NA NA NA
## 2 1 02/01… NA NA NA NA NA NA
## 3 2 03/01… NA NA NA NA NA NA
## 4 3 04/01… NA NA NA NA NA NA
## 5 4 05/01… NA NA NA NA NA NA
## 6 5 06/01… NA NA NA NA NA NA
## # … with 128 more variables: Transit Ridership - Other Transit Modes - Adjusted <dbl>,
## # Transit Ridership - Fixed Route Bus - Adjusted <dbl>, Transit Ridership - Urban Rail - Adjusted <dbl>,
## # Freight Rail Intermodal Units <dbl>, Freight Rail Carloads <dbl>, Highway Vehicle Miles Traveled - All Systems <dbl>,
## # Highway Vehicle Miles Traveled - Total Rural <dbl>, Highway Vehicle Miles Traveled - Other Rural <dbl>,
## # Highway Vehicle Miles Traveled - Rural Other Arterial <dbl>, Highway Vehicle Miles Traveled - Rural Interstate <dbl>,
## # State and Local Government Construction Spending - Breakwater/Jetty <dbl>,
## # State and Local Government Construction Spending - Dam/Levee <dbl>, …
3.1.2 Oil
For oil data, we simply remove the observations prior to Jan 2017 to match the transportation data above and convert the dates into the yearmon
type as discussed in section 2.
3.2 Covid data
3.2.1 United States COVID-19 Cases and Deaths by State overTime
Observe that covid data are daily based and cpi data are monthly based, we use dplyr
to shrink covid data to monthly based by summing the number of cases and deaths for every state, where we denote this new dataset cdcmonth.csv
after adjusting date into the yearmon
format. We also noticed that for the attributes consent_cases
andconsent_deaths
some NA
s are denoted as the string 'N\A'
, where it makes sense to replace them with actual NA
since they are indeed missing values. In addition, we want to investigate cross state covid cases, where we want to focus on 50 main states. In particular, we have 60 states in total and the reported observations for New York City and Washington D.C. are separated from New York and Maryland respectively, where other states are disjoint from the 50. Hence, we adjust the observations for New York City and Washington D.C. and add them to New York and Maryland in cdcmonth.csv
respectively. Furthermore, we replace the NA
s in probable cases and deaths with 0s and further replace confirmed cases and deaths with total cases - probable cases and total deaths - probable deaths since some of the confirmed cases and deaths are < 0 as mentioned in section 2 and probable cases are considered to be unreliable records. We have also created 3 new datasets cdc_all.csv
, cdc_monthly_all_state.csv
and data_monthly.csv
, where cdc_all.csv
is obtained from getting the daily reported cases and deaths for the entire US by summing over the cases and death across all the states, then we further shrink the data to the national monthly cases and deaths cdc_monthly_all_state.csv
by grouping the data by each month. We eventually merge cdc_monthly_all_state.csv
and cpi_monthly.csv
obtained above to conduct analysis between covid and cpi attributes as well as performing visualizations such as plotting us maps.
## # A tibble: 6 × 5
## new_case new_death date state statefull
## <dbl> <dbl> <chr> <chr> <chr>
## 1 227 3 2020-4 AK Alaska
## 2 4871 1 2021-4 AK Alaska
## 3 2297 19 2020-8 AK Alaska
## 4 12078 49 2021-8 AK Alaska
## 5 13978 76 2020-12 AK Alaska
## 6 0 0 2020-2 AK Alaska
## new_case new_death date state statefull
## 1 227 3 Apr 2020 AK Alaska
## 2 4871 1 Apr 2021 AK Alaska
## 3 2297 19 Aug 2020 AK Alaska
## 4 12078 49 Aug 2021 AK Alaska
## 5 13978 76 Dec 2020 AK Alaska
## 6 0 0 Feb 2020 AK Alaska