Chapter 2 Data sources

2.1 CPI

The CPI data is collected from BLS using API. BLS is the abbreviation of U.S. Bureau of Labor Statistics, where it provides public API to help people access the data. Price indexes are available for the U.S., the four Census regions, nine Census divisions, two size of city classes, eight cross-classifications of regions and size-classes, and for 23 local areas. Indexes are available for major groups of consumer expenditures (food and beverages, housing, apparel, transportation, medical care, recreation, education and communications, and other goods and services), for items within each group, and for special categories, such as services. We chose Chained CPI-ALL Urban Consumers as our indicator of living expense. It is introduced to better estimate the real living expense of people compare to classic CPI.

The data is collected using API. The item codes was used to generate API requests. After running for loop with each item codes, we saved our data to a public googlesheet so as to help others access the data more conveniently. The data we used is monthly data as mentioned in official introduction.

2.2 Bureau of Transportation Statistics

Monthly Transportation Statistics dataset is a collection of national statistical data on transportation. The Bureau of Transportation Statistics combines the latest data from across the Federal government and transportation industry. Monthly Transportation Statistics contains over 50 time series from nearly two dozen data sources. Apart from the Date attribute in the character format, it has 134 numeric transportation attributes(e.g. Highway fatalities, highway vehicle miles traveled for all systems, etc.), where we are specifically interested in U.S. Airline Traffic - Total - Seasonally Adjusted that records the the total number of passengers travelling by air from Jan 1947 to Jul 2021 in a monthly basis(We need both international and domestic passengers since they both have potential contributions to covid cases). However, lots of the attributes including our target have lots of missing values, U.S. Airline Traffic - Total - Seasonally Adjusted has no record prior to Jan 2017 in particular, which is not a major problem since we are only interested in air travel records from recent years.

The data can be directly downloaded from https://data.bts.gov/Research-and-Statistics/Monthly-Transportation-Statistics/crem-w557.

2.3 U.S. Field Production of Crude Oil

The oil data records the monthly field production of crude Oil thousand barrels from Jan 1920 to Sep 2021, where the attributes Month and U.S. Field Production of Crude Oil Thousand Barrels per Day are in the character and numeric format respectively. We would potentially trim the data before Jan 2017 in order to match the scope of the transportation statistics data above.

This data is also directly downloaded from https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=MCRFPUS2&f=A

2.4 United States COVID-19 Cases and Deaths by State overTime

The covid data is collected by the centers for Disease Control and Prevention(CDC), which reports aggregate counts of daily confirmed and probable COVID-19 cases and death numbers online for different states in the US. There are 5 character variables, where the submission_date and created_at are supposed to be converted to date variable types and state contains the abbreviation of all states in the US. The other 10 are numeric variables, including confirmed total cases and death, probable total cases and death, confirmed new cases and death etc. There are only few missing reported cases for this data, so it would be reasonable to replace them with 0s given that we would combine the daily cases to monthly cases for each state via using sum. However, some attributes such as confirmed cases have negative values, which makes no sense in a real world setting and it’s definitely our concern to remove or replace them using some tactics. This is one of the main datasets of interest where we want to explore how the variation of covid cases affect CPI trends.

More details are on the website where the data is downloaded: https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36.