October 16th, 2019
For nearly any business or organization conducting operations across international borders, it is critical to understand trends in the average life expectancy for inhabitants of a country and a country’s economic status as developing or developed. In order to understand these trends and make use of these different statistics, a predictive model for this information is necessary. Fortunately, data for life expectancy and a country’s economic status is readily available from the World Health Organization and the United Nations to show various correlations with life expectancy and economic status. From this information, a data scientist will ask two key questions. First of all, how can health, social and economic data be used to predict a country’s life expectancy? Secondly, how can the same health, social and economic factors be used to predict a country’s economic status?
During the exploratory data analysis part of the project, we fitted a number of different models to predict life expectancy. After cross validating the ridge, elastic, and lasso regression models on the test set, we compared the mean square prediction error values to find the best model. The first research question that we decided to explore further involved using lasso regression to predict the life expectancy variable using other quantitative variables in the dataset. Before building another lasso model, we explored the distribution of the response variable. Life expectancy is a left skewed variable, so we decided to transform the variable before doing further analysis. With a log transformation, the life expectancy was still left skewed. We decided to use the transformTukey() function in the rcompanion package, which performed iterative Shapiro-Wilk test to find the optimal lambda value that minimizes the W statistic (which indicates normality). This resulted in a more normally distributed variable.
Using the same process, the lasso model that was trained using the transformed life expectancy resulted in a mean square prediction error of 4280.47. Since this model involved a transformation, it is difficult to compare the mean square prediction error with the previous model since they are on different scales. The variables that were not included in the final lasso model were Measles, thinness among children from 5-9 years, total expenditure, infant deaths, and population. When compared to the previous lasso model, we found that only Measles and population had very small coefficients. The final lasso model included fewer variables, which indicates that it penalized coefficients more to minimize the sum of squared error.