Chapter 4 Missing values

4.1 CPI

There are no missing values for this data. This data contains essential economic metrics that are relevant to national development so that it’s rare to observe missing data.

4.2 Air Travel

For the variables of interest in the transportation data, we can observe that there is no missing value for date and around 90% of missing rows in ``Travel```, which justifies the trimming in section 3 where all observations before 2017 are emitted.

4.3 Oil

No missing data is observed in the oil data using complete.cases().

4.4 United States COVID-19 Cases and Deaths by State overTime:

Overall, there are only 21 distinct missing pattern for nearly 40000 rows of data. In particular, less than half of the rows are complete cases. Total confirmed death, total probable death, total probable cases and total confirmed cases have the highest numbers of missing data, where the percent of missing values are roughly identical to 20 percent. These four variables frequently have synchronous missing values in terms of rows. Only around 15 percent of total rows have these four variables missing at the same time, where the percentages of missing consent cases and consent deaths are roughly identical. The percentages of missing new probable cases and new probable deaths are also roughly identical. Though missing percentages for these four variables are significant as well, they are essentially lower than the previous four attributes, where all other attributes have no missing data.

There are also some correlations between variables. Consent cases and consent death variable indicates if the state consent to disclosure the data on cases and deaths. There are three values: ‘agree’, ‘not agree’ and NA (the missing value). After understanding the variable, we found that when consent cases is missing, total probable cases and total confirmed cases are missing as well. Same thing happen to consent deaths, total probable deaths and total confirmed deaths. Therefore, after the missing value analysis, we can conclude that missing values in consent cases and consent deaths is equivalent to ‘not agree’. And now the consent cases and consent death can be translate to a logical variable having 0 and 1 as values.