The purpose of our project is to effectively predict the test scores of high schoolers in 1986 based on a large number of predictor variables, some of which are gender, ethnicity, family income, etc.
Our ultimate goal for this project is to reason through the different predictor variables to find the model that can best predict high school test scores.
We will be using the College Distance dataset from the AER package that can be found here
The data was obtained by the Department of Education, and contains many different social and economic variables, including: gender, ethnicity, whether or not the mother/father went to college, if the family owns their home, county, unemployment rate, and more.
We first load college_distance.csv
into R along with some required packages (hidden):
collegedistance = read.csv("CollegeDistance.csv")
This dataset has a total of 4739 observations of 15 variables (14 predictors and 1 response). The said variables are:
gender
: a factor indicating genderethnicity
: factor indicating ethnicity (African-American, Hispanic or other)score
: base year composite test scorefcollege
: factor. Is the father a college graduate?mcollege
: factor. Is the mother a college graduate?home
: factor. Does the family own their home?urban
: factor. Is the school in an urban area?unemp
: country unemployment rate in 1980wage
: state hourly wage in manufacturing in 1980distance
: distance from 4-year college (in 10 miles)tuition
: average state 4-year college tuition (in 1000 USD)education
: number of years of educationincome
: factor. Is the family income above 25,000 USD per year?region
: factor indicating region (West or other)The dataset meets all of the set criteria for the project. Now lets look for missing values for our next step in data cleaning.
We received a warning message when loading the data that there was an unnamed column. Taking a look, we saw that R created another x-coordinate column. We will get rid of that.
collegedistance = collegedistance[ , -1]
head(collegedistance)
## gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other
Next, we will see if there are any missing values within our dataset.
sum(is.na(collegedistance))
## [1] 0
Great! We see that there are no missing values in our dataset, so no additional work needs to be done there.
We also notice that many of our variables are factor variables, so we need to convert those to binary values. The dependent variables will be converted as follows:
ethnicity
: converted into two variables, hispanic
and afam
. A value of 1 means the student is Hispanic or African-American, respectivelygender
: 1 for male, 0 for femalefcollege
: 1 for yes, 0 for nomcollege
: 1 for yes, 0 for nohome
: 1 for yes, 0 for nourban
: 1 for yes, 0 for noincome
: 1 for high, 0 for lowregion
: 1 for west, 0 for otherWe’ll use a function called cleanData()
to modularize our code
cleanData = function(data) {
# convert the ethnicity label and remove the old one
data$hispanic = 1 * (data$ethnicity == "hispanic")
data$afam = 1 * (data$ethnicity == "afam")
data = data[-1 * which(names(data) == "ethnicity")]
# convert the rest of the labels with automation
convert_labels = c(c("fcollege", "yes"), c("mcollege", "yes"), c("home", "yes"),
c("urban", "yes"), c("income", "high"), c("region", "west"),
c("gender", "male"))
# loop through each label
for (label_index in seq(1, length(convert_labels), 2)) {
# get the column name and positive label name
col_name = convert_labels[label_index]
positive_label = convert_labels[label_index + 1]
# convert the label appropriately
data[col_name] = 1 * (data[col_name] == positive_label)
}
# return the data
return(data)
}
Now let’s take a look at the data with our adjusted variables
collegedistance = cleanData(collegedistance)
head(collegedistance)
## gender score fcollege mcollege home urban unemp wage distance tuition
## 1 1 39.15 1 0 1 1 6.2 8.09 0.2 0.88915
## 2 0 48.87 0 0 1 1 6.2 8.09 0.2 0.88915
## 3 1 48.74 0 0 1 1 6.2 8.09 0.2 0.88915
## 4 1 40.40 0 0 1 1 6.2 8.09 0.2 0.88915
## 5 0 40.48 0 0 0 1 5.6 8.09 0.4 0.88915
## 6 1 54.71 0 0 1 1 5.6 8.09 0.4 0.88915
## education income region hispanic afam
## 1 12 1 0 0 0
## 2 12 0 0 0 0
## 3 12 0 0 0 0
## 4 12 0 0 0 1
## 5 13 0 0 0 0
## 6 12 0 0 0 0
It appears that none of the other variables need to be changed
Before we start creating our models, we’ll take a look at our variables to ensure that there is no correlation affecting our model
To do this, we’ll need a correlation matrix
# get the correlation matrix
college_cor = round(cor(collegedistance), 2)
# remove the NA values
college_cor[which(is.na(college_cor))] = 0
# summary of the correlation
head(college_cor)
## gender score fcollege mcollege home urban unemp wage distance
## gender 1.00 0.08 0.04 0.02 0.04 0.01 -0.03 0.03 0.00
## score 0.08 1.00 0.25 0.19 0.13 -0.09 -0.03 0.12 -0.07
## fcollege 0.04 0.25 1.00 0.43 0.08 -0.05 -0.10 0.03 -0.11
## mcollege 0.02 0.19 0.43 1.00 0.06 -0.03 -0.09 0.02 -0.08
## home 0.04 0.13 0.08 0.06 1.00 -0.10 0.01 0.07 0.02
## urban 0.01 -0.09 -0.05 -0.03 -0.10 1.00 -0.05 -0.03 -0.29
## tuition education income region hispanic afam
## gender 0.01 0.01 0.06 -0.01 0.02 -0.04
## score 0.13 0.47 0.18 -0.03 -0.16 -0.28
## fcollege 0.03 0.28 0.35 0.03 -0.11 -0.10
## mcollege 0.04 0.23 0.25 -0.01 -0.11 -0.02
## home 0.00 0.10 0.14 0.00 -0.06 -0.13
## urban -0.02 -0.02 -0.07 -0.05 0.08 0.18
This table is a little difficult to analyze, so we’ll convert it to a visual heatmap
# print our heatmap (code hidden above)
ggheatmap
From the above heatmap, we see that there are only a few variables that could be problematic. To further validate our findings, we’ll explore the variance inflation factors for a simple additive model
simple_add = lm(score ~ ., data = collegedistance)
vif(simple_add)
## gender fcollege mcollege home urban unemp wage distance
## 1.009046 1.399584 1.263883 1.051489 1.158750 1.264288 1.225818 1.256151
## tuition education income region hispanic afam
## 1.869172 1.135972 1.214674 1.578257 1.236430 1.180550
We can see from this model that there are 0 values greater than 5, so there appears to be no collinearity between our dependent variables
We will now do a brief summary of all of our variables, just to see if there are any more than we can remove
colMeans(collegedistance)
## gender score fcollege mcollege home urban unemp
## 0.4513610 50.8890293 0.2080608 0.1373708 0.8202152 0.2329605 7.5972146
## wage distance tuition education income region hispanic
## 9.5005065 1.8028698 0.8146082 13.8077654 0.2880355 0.1989871 0.1905465
## afam
## 0.1658578
Since all of these arithmetic means
look good (especially since none of the factor variables have a mean of 0 or 1), we can start to build our model
First, we’ll take a look at our dependent variable, score
to see if any transformations are necessary. We expect to see a normal distribution
The function buildHistogram()
will help us see this visually
buildHistogram = function(values, title) {
# create the histogram
freq_hist = hist((values - mean(values)) / sd(values),
col = "darkorange",
xlab = "Score",
main = title)
# overlay the normal curve
multiplier = freq_hist$counts / freq_hist$density
x = seq(-3, 3, length.out = length(values))
curve(multiplier * dnorm(x), col = "dodgerblue", lwd=2, add=TRUE, yaxt="n")
}
buildHistogram(collegedistance$score, "Frequency of Normalized Score")
The above plot looks fairly normal, so we concluded that a dependent variable transformation was not needed in our model
We’ll now see if any independent variable transformations, specifically for our numeric variables, might be necessary by using scatter plots and our own statistical judgement and intuition as a starting point
To start, we’ll create a helper function to help us visualize our variables plotted against score
buildScatterPlot = function(dep_label, ind_label, color1, color2 = color,
interaction = TRUE) {
plot(y = unlist(collegedistance[tolower(dep_label)]),
x = unlist(collegedistance[tolower(ind_label)]),
xlab = ind_label,
ylab = dep_label,
main = paste(dep_label, " vs ", ind_label),
col = ifelse(interaction, color1, color2),
pch = 20,
cex = 1,
cex.lab = 2,
cex.axis = 2,
cex.main = 2)
}
Now we’ll make the plots for our numeric variables
par(mfrow = c(2, 2))
buildScatterPlot("Score", "Wage", "dodgerblue")
buildScatterPlot("Score", "Unemp", "darkorange")
buildScatterPlot("Score", "Distance", "firebrick")
buildScatterPlot("Score", "Tuition", "mediumpurple")
There appears to be no transformations needed, and unforunately our data looks fairly random, possibly indicating that there is no strong trend between our variables
We also created charts to analyze simple interaction variables, but we were unsuccessful in finding trends between these interactions and score. This code can be found in the appendix of our project, under Interaction Visualization. Hence, it’s difficult for us to visually extract trends, and we must now rely on statistcal metrics to continue
To make it easier for us to narrow down our options when selecting a model, we will check our assumptions using a variety of functions/tests.
# performs the Breusch-Pagan test
get_bp = function(model, alpha = 0.01) {
bptest_res = bptest(model)
decision = bptest_res$p.value < alpha
return(list(decision = ifelse(decision, "Reject", "Fail To Reject"),
stat = bptest_res$statistic, pvalue = bptest_res$p.value))
}
# performs the Shapiro-Wilk test
get_shapiro = function(model, alpha = 0.01) {
shapiro_res = shapiro.test(resid(model))
decision = shapiro_res$p.value < alpha
return(list(decision = ifelse(decision, "Reject", "Fail to Reject"),
stat = shapiro_res$statistic, pvalue = shapiro_res$p.value))
}
# finds the RMSE of our model
get_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
# combines the above helper methods into a readable format
evaluate = function(name, model, response_func = I, data = collegedistance) {
set.seed(21)
data.frame(Model = name,
rmse = get_rmse(model),
adj_r2 = rsq(model, adj = TRUE),
aic = AIC(model),
bic = BIC(model),
coeff = length(coef(model)),
shapiro_dec = get_shapiro(model)$decision,
bp_dec = get_bp(model)$decision)
}
Next, let’s start creating some basic models.
# Start with a model with all predictors.
full_score_model = lm(score ~ ., data = collegedistance)
full_score_eval = evaluate("All Additive Model", full_score_model)
full_score_eval
## Model rmse adj_r2 aic bic coeff shapiro_dec
## BP All Additive Model 7.107605 0.334981 32038.51 32141.92 15 Reject
## bp_dec
## BP Reject
Now that we have a baseline quality criterion, let’s compare this to a model with the predictors chosen to be the most significant from the heat map.
# Comparing with a model using selected predictors.
smaller_add_model = lm(score ~ education + fcollege + mcollege + wage + tuition +
gender + home + income, data = collegedistance)
smaller_add_eval = evaluate("Smaller Add Model", smaller_add_model)
smaller_add_eval
## Model rmse adj_r2 aic bic coeff shapiro_dec
## BP Smaller Add Model 7.499351 0.2586984 32547.14 32611.77 9 Reject
## bp_dec
## BP Reject
We see that the additive model with chosen predictors has a lower adjusted \(r^2\) and a higher BIC meaning the model wasn’t improved by removing those predictors. Let’s try comparing it to the interaction model.
full_int_mod = lm(score ~ .^2, data = collegedistance)
full_int_eval = evaluate("All Interaction Model", full_int_mod)
full_int_eval
## Model rmse adj_r2 aic bic coeff shapiro_dec
## BP All Interaction Model 7.130664 0.3437096 32064.74 32749.88 106 Reject
## bp_dec
## BP Reject
We see that the interaction model has a higher adjusted \(r^2\) than both the full and smaller model. Let’s use an ANOVA \(F\)-test to choose between the All Additive Model and All Interaction Model.
anova(full_score_model, full_int_mod)
## Analysis of Variance Table
##
## Model 1: score ~ gender + fcollege + mcollege + home + urban + unemp +
## wage + distance + tuition + education + income + region +
## hispanic + afam
## Model 2: score ~ (gender + fcollege + mcollege + home + urban + unemp +
## wage + distance + tuition + education + income + region +
## hispanic + afam)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4724 237888
## 2 4634 230293 90 7595 1.6981 5.067e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is extremely low 5.066955210^{-5}, so between the two, we choose the interaction model to begin narrowing down our predictors.
Also, we can see that the interaction model had a smaller AIC than BIC, so we will move forward with using backwards AIC to search for and narrow down parameters.
We first begin by using backwards AIC to do a backwards parameter search.
full_int_mod = lm(score ~ . ^ 2, data = collegedistance)
distance_aic = step(full_int_mod, direction = "backward", trace = 0)
coef(distance_aic)
## (Intercept) gender fcollege mcollege
## 27.51578466 -8.45366574 0.81384297 3.99964110
## home urban unemp wage
## 0.18519034 -2.03672698 -0.06018998 -0.10246390
## distance tuition education income
## -0.44747309 1.37291771 1.80866472 -0.97737050
## region hispanic afam gender:home
## 2.97861345 -5.12992820 -13.38382092 -0.86720151
## gender:urban gender:wage gender:education gender:hispanic
## 0.77074554 0.38898937 0.43096415 1.14001836
## gender:afam fcollege:income mcollege:education mcollege:income
## 1.27223670 0.84145397 -0.25943771 1.72731246
## home:urban home:unemp home:distance home:tuition
## 1.48638416 -0.19985687 0.18255690 1.69659493
## home:region unemp:distance wage:region wage:afam
## 1.36196167 0.03505865 0.42560855 0.76479061
## distance:hispanic distance:afam tuition:region tuition:hispanic
## -0.18692067 -0.34256326 -5.12187834 1.54255097
## education:region income:hispanic
## -0.35646928 1.40776578
Now that we have our significant coefficients, we can attempt to create a larger model, and run AIC
once again
distance_larger = lm(score ~ .^2 + I(gender^2) + I(fcollege^2) + I(mcollege^2) +
I(home^2) + I(urban^2) + I(unemp^2) + I(wage^2) +
I(distance^2) + I(tuition^2) + I(education^2) +
I(hispanic^2) + I(afam^2),
data = collegedistance)
distance_aic2 = step(distance_larger, direction = "backward", trace = 0)
coef(distance_aic2)
## (Intercept) gender fcollege mcollege
## -6.92257098 -8.81842187 0.75287995 0.24920943
## home urban unemp wage
## 0.16661740 -0.70409981 0.03838828 -0.08707851
## distance tuition education income
## -0.41932121 1.02692976 6.77157103 -0.99153378
## region hispanic afam I(distance^2)
## 1.36351911 -5.84466691 -13.55114414 0.02622444
## I(education^2) gender:home gender:urban gender:wage
## -0.17663289 -0.79186265 0.76748589 0.28024332
## gender:tuition gender:education gender:region gender:hispanic
## 1.25447229 0.43641796 1.05006853 1.25172546
## gender:afam fcollege:income mcollege:income home:urban
## 1.31295054 0.89990994 1.65491986 1.45738018
## home:unemp home:distance home:tuition home:region
## -0.20758448 0.22831047 1.64485020 1.29236644
## urban:unemp unemp:hispanic wage:region wage:afam
## -0.19133902 0.22898793 0.54209956 0.76591562
## distance:hispanic distance:afam tuition:region education:region
## -0.26260115 -0.33830105 -5.28140769 -0.35341008
## income:hispanic
## 1.55536706
Now we’ll evaluate our first AIC model, which has 38 coefficients
evaluate("All Interactions", distance_aic)
## Model rmse adj_r2 aic bic coeff shapiro_dec
## BP All Interactions 7.055426 0.3478309 31968.91 32220.99 38 Reject
## bp_dec
## BP Reject
And our second AIC model, which has 41 coefficients
evaluate("Some interactions", distance_aic2)
## Model rmse adj_r2 aic bic coeff shapiro_dec
## BP Some interactions 7.042325 0.3505196 31952.31 32223.78 41 Reject
## bp_dec
## BP Reject
By looking at the parameters returned by backwards AIC, we can narrow them down to create a ‘better’ model
Note: This model does have too many parameters, but we will factor in the number of parameters used when looking at metrics like Adjusted R Squared
and ANOVA
test statistics
# create the model with large parameters
large_param_model = lm(score ~ gender + fcollege + fcollege:mcollege + home +
urban + wage + distance + I(distance^(0.25)) +
tuition + education + income + hispanic + afam +
I(distance^2) + I(tuition^2) + I(education^2) +
gender:home + gender:wage + gender:education +
gender:hispanic + gender:afam + fcollege:income +
mcollege:income + home:urban + home:unemp +
home:distance + unemp:distance +
unemp:hispanic + wage:hispanic + wage:afam +
distance:hispanic + distance:afam +
tuition:education + income:hispanic + I(tuition^3) +
log(tuition) + I(education^3)+ log(education) +
I(tuition^0.25) + I(education^0.25) + I(wage^2) +
gender:I(wage^2) + gender:I(log(wage)) +
hispanic:I(wage^2) + hispanic:I(log(wage)) +
home:I(unemp^2) + home:I(unemp^0.25) +
home:I(log(unemp)) + home:I(distance^2) +
home:I(distance^0.25) + unemp:I(distance^2) +
unemp:I(distance^0.25) + hispanic:I(distance^2) +
hispanic:I(distance^0.25) + afam:I(distance^2) +
afam:I(distance^0.25) + region,
data = collegedistance)
# call the helper function
evaluate("Large Parameter Model", large_param_model)
## Model rmse adj_r2 aic bic coeff shapiro_dec
## BP Large Parameter Model 7.045638 0.351877 31959.21 32340.56 58 Reject
## bp_dec
## BP Reject
By running this model through our quality criterion function, we see that our LOOCV-RMSE, adjusted \(R^2\), AIC and BIC numbers have improved significantly. However, the assumption tests still fail.
Finally, we will perform ANOVA tests on all three of our contenders to erase any doubt when choosing the best model to move forward with.
anova(distance_aic, distance_aic2, test = "F")
## Analysis of Variance Table
##
## Model 1: score ~ gender + fcollege + mcollege + home + urban + unemp +
## wage + distance + tuition + education + income + region +
## hispanic + afam + gender:home + gender:urban + gender:wage +
## gender:education + gender:hispanic + gender:afam + fcollege:income +
## mcollege:education + mcollege:income + home:urban + home:unemp +
## home:distance + home:tuition + home:region + unemp:distance +
## wage:region + wage:afam + distance:hispanic + distance:afam +
## tuition:region + tuition:hispanic + education:region + income:hispanic
## Model 2: score ~ gender + fcollege + mcollege + home + urban + unemp +
## wage + distance + tuition + education + income + region +
## hispanic + afam + I(distance^2) + I(education^2) + gender:home +
## gender:urban + gender:wage + gender:tuition + gender:education +
## gender:region + gender:hispanic + gender:afam + fcollege:income +
## mcollege:income + home:urban + home:unemp + home:distance +
## home:tuition + home:region + urban:unemp + unemp:hispanic +
## wage:region + wage:afam + distance:hispanic + distance:afam +
## tuition:region + education:region + income:hispanic
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4701 232156
## 2 4698 231051 3 1104.7 7.487 5.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
distance_aic2
Now, let’s compare distance_aic2
with large_param_model
.
anova(distance_aic2, large_param_model, test = "F")
## Analysis of Variance Table
##
## Model 1: score ~ gender + fcollege + mcollege + home + urban + unemp +
## wage + distance + tuition + education + income + region +
## hispanic + afam + I(distance^2) + I(education^2) + gender:home +
## gender:urban + gender:wage + gender:tuition + gender:education +
## gender:region + gender:hispanic + gender:afam + fcollege:income +
## mcollege:income + home:urban + home:unemp + home:distance +
## home:tuition + home:region + urban:unemp + unemp:hispanic +
## wage:region + wage:afam + distance:hispanic + distance:afam +
## tuition:region + education:region + income:hispanic
## Model 2: score ~ gender + fcollege + fcollege:mcollege + home + urban +
## wage + distance + I(distance^(0.25)) + tuition + education +
## income + hispanic + afam + I(distance^2) + I(tuition^2) +
## I(education^2) + gender:home + gender:wage + gender:education +
## gender:hispanic + gender:afam + fcollege:income + mcollege:income +
## home:urban + home:unemp + home:distance + unemp:distance +
## unemp:hispanic + wage:hispanic + wage:afam + distance:hispanic +
## distance:afam + tuition:education + income:hispanic + I(tuition^3) +
## log(tuition) + I(education^3) + log(education) + I(tuition^0.25) +
## I(education^0.25) + I(wage^2) + gender:I(wage^2) + gender:I(log(wage)) +
## hispanic:I(wage^2) + hispanic:I(log(wage)) + home:I(unemp^2) +
## home:I(unemp^0.25) + home:I(log(unemp)) + home:I(distance^2) +
## home:I(distance^0.25) + unemp:I(distance^2) + unemp:I(distance^0.25) +
## hispanic:I(distance^2) + hispanic:I(distance^0.25) + afam:I(distance^2) +
## afam:I(distance^0.25) + region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4698 231051
## 2 4681 229734 17 1317.2 1.5788 0.06098 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
large_param_model
Great! We’ve confirmed that large_param_model
is indeed the preferred model of all three, and now we will move on to finalizing the model.
Before we finalize our model, we also need to make sure that we remove all of the unusual points in our data set
The function getUnsualPoints()
below will help us streamline the process of finding these points
# finds the unusual points (leverage, residual, influential) of a certain model
getUnusualPoints = function(model) {
# calculate leverage
num_high_leverage = sum(hatvalues(model) > 2 * mean(hatvalues(model)))
# calculate large residuals
num_large_residual = sum(abs(rstandard(model)) > 2)
# calculate influential
cooks_dist = cooks.distance(model)
num_influential = sum(cooks_dist > 4 / length(cooks_dist))
# return the values
return(list(leverage = num_high_leverage,
residual = num_large_residual,
influential = num_influential))
}
Let’s run the above function with our current best model
unusual_point_data = getUnusualPoints(large_param_model)
print(unusual_point_data)
## $leverage
## [1] 267
##
## $residual
## [1] 182
##
## $influential
## [1] 206
We find that we have 206 high influence points and 182 large residual points points, so we will remove them and create our final model
We will repeat this process until the number of observations in our data set is still in the required amount of 2000, or if there’s nothing else to remove
# initialize our final model and cleaned dataset
final_model = large_param_model
collegedistance_cleaned = collegedistance
while(nrow(collegedistance_cleaned) > 2000) {
# calculate influence indexes
infl_indexes = which(cooks.distance(final_model) > 4 /
length(cooks.distance((final_model))))
# calculate residual indexes
resid_indexes = which(abs(rstandard(final_model)) > 2)
# get all unique subset indexes
subset_indexes = unique(c(infl_indexes, resid_indexes))
# break cases
if (nrow(collegedistance_cleaned) - length(subset_indexes) < 2000) {
break
}
if (length(subset_indexes) == 0) {
break
}
# adjust the data set
collegedistance_cleaned = collegedistance_cleaned[-1 * subset_indexes,]
# recalculate the final model
final_model = lm(score ~ gender + fcollege + fcollege:mcollege + home +
urban + wage + distance + I(distance^(0.25)) +
tuition + education + income + hispanic + afam +
I(distance^2) + I(tuition^2) + I(education^2) +
gender:home + gender:wage + gender:education +
gender:hispanic + gender:afam + fcollege:income +
mcollege:income + home:urban + home:unemp +
home:distance + unemp:distance +
unemp:hispanic + wage:hispanic + wage:afam +
distance:hispanic + distance:afam +
tuition:education + income:hispanic + I(tuition^3) +
log(tuition) + I(education^3)+ log(education) +
I(tuition^0.25) + I(education^0.25) + I(wage^2) +
gender:I(wage^2) + gender:I(log(wage)) +
hispanic:I(wage^2) + hispanic:I(log(wage)) +
home:I(unemp^2) + home:I(unemp^0.25) +
home:I(log(unemp)) + home:I(distance^2) +
home:I(distance^0.25) + unemp:I(distance^2) +
unemp:I(distance^0.25) + hispanic:I(distance^2) +
hispanic:I(distance^0.25) + afam:I(distance^2) +
afam:I(distance^0.25) + region,
data = collegedistance_cleaned)
}
After performing this final step, we obtain the following test results:
final_eval_data = evaluate("Final Model", final_model)
print(final_eval_data)
## Model rmse adj_r2 aic bic coeff shapiro_dec bp_dec
## BP Final Model 3.235253 0.8361153 13925.24 14272.98 58 Reject Reject
We are left with a model with RMSE
of 3.2352525 and Adjusted R Squared
of 0.8361153. Removing these volatile entries from our data set greatly increased the accuracy of our model
We will now test the various diagnostics and assumptions of our final model
The helper functions below will help us create the visuals we need:
# create the fitted versus residuals scatter plot
createFittedResidualsPlot = function(model) {
plot(fitted(model), resid(model), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals",
main = "Fitted vs Residuals of Final Model")
abline(h = 0, col = "darkorange", lwd = 2)
}
# creates a histogram of residuals
createResidualHistogram = function(model) {
hist(resid(model),
xlab = "Residuals",
main = "Histogram of Residuals for Final Model",
col = "darkorange",
border = "dodgerblue",
breaks = 20)
}
# creates a Q-Q plot
createQQPlot = function(model) {
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey",
cex.main = 2, cex.axis = 2, cex.lab = 2)
qqline(resid(model), col = "dodgerblue", lwd = 2)
}
First, we will take a look at the residuals of our model
par(mfrow = c(1, 2))
createResidualHistogram(final_model)
createFittedResidualsPlot(final_model)
Our residuals have a fairly normal distribution, albeit slightly more similar to a t-distribution, and there doesn’t seem to be any general trend in the scatter plot
We can double-check this with the Breusch-Pagan test
# obtain our B-P test results
final_bp = get_bp(final_model)
print(final_bp)
## $decision
## BP
## "Reject"
##
## $stat
## BP
## 144.7343
##
## $pvalue
## BP
## 1.419751e-09
Unforunately, when we run this test, we obtain a test statistic of 144.7342608, giving us a p-value of 1.419750910^{-9}. It makes sense that we obtained visual results for homoscedasticity but statistical results for heterocedasticity, as we have shown that our data was difficult to predict with a model
Now we’ll take a look at the Q-Q plot for our final model
createQQPlot(final_model)
We notice that visually, our Q-Q plot is not perfectly fitting our data. We can verify that this assumption is indeed violated with the Shapiro-Wilks test
final_shapiro = get_shapiro(final_model)
print(final_shapiro)
## $decision
## [1] "Reject"
##
## $stat
## W
## 0.9790732
##
## $pvalue
## [1] 1.957788e-19
This test gives us a test statistic of 0.9790732 and thus a p-value of 1.957788210^{-19}, meaning that we reject \(H_0\). Hence, this assumption is violated
Finally, we’ll look at the number of unusual points in our data
final_unusual = getUnusualPoints(final_model)
print(final_unusual)
## $leverage
## [1] 144
##
## $residual
## [1] 0
##
## $influential
## [1] 0
Our data has 144 points of high leverage, 0 points of with large residual, and 0 influential points. Even with intensive data cleaning, these results prove our suspicions that the data selected is highly volatile
We began with a dataset of 4739 observations of 15 variables. We first cleaned our data and checked for any missing values. Fortunately, we did not have any and moved on to creating additional factor variables. Then we took a look at correlation between our predictors and observation variables in order to narrow down our method of choice in the ‘Method Exploration’ section. We decided upon using backwards AIC as our method of choice to narrow down predictors and successfully ended up with a promising model (a decision enforced by our ANOVA testing). To further improve upon our promising model, we applied a variety of response and predictor transformations and removed unusal points, ultimately giving us our “best” model.
After going through the proccesses of choosing a “best” model, the model that performed the best was the polished large_param_model
, which modeled score against a number of quadratic, interaction, and additive predictors chosen using backwards AIC. However, even after utilizing backwards AIC, the model still had a relatively low \(R^2\) value. This was significantly improved upon by removing the majority of any unusual points. With the final polished large_param_model
, the relatively low cross validated \(RMSE\) and high adjusted \(R^2\) suggests that the model does an excellent job at predicting test scores. Ultimately, we believe that we succeeded in building a model that was accurate in predicting high schoolers’ test scores based off of a number of carefully chosen predictor variables.
The main purpose of this study was to see how different social and economic factors can affect a student’s ability to perform in school. In this section, we will interpret the coefficients of our model
First we’ll create the dataframe with the data we need
# initialize the dataframe
coef_df = data.frame(matrix(ncol = length(names(collegedistance)) - 1, nrow = 1))
score_index = which(names(collegedistance) == "score")
colnames(coef_df) = names(collegedistance)[-1 * score_index]
# iterate through each name
for (name in names(collegedistance)[-1 * score_index]) {
# get the sum of coefficients
coef_indexes = grepl(name, names(coef(final_model)))
coef_avg = mean(coef(final_model)[coef_indexes])
# add to the dataframe
coef_df[name] = coef_avg
}
# melt into our data
coef_data = melt(coef_df)
## No id variables; using all as measure variables
# sort by value
coef_data = coef_data[order(abs(coef_data$value)),]
Now we’ll present this data in a visual form.
createWeightPlot = function(start, end) {
barplot(height = coef_data[start:end,2],
names.arg = coef_data[start:end,1],
col = ifelse(coef_data[start:end,2] < 0, "tomato", "skyblue2"),
border = ifelse(coef_data[start:end,2] < 0, "darkred", "darkblue"),
xlab = "Variable Name",
ylab = "Average Weight",
main = "Average Weight of Variables",
cex.lab = 1.5,
cex.axis = 1.5,
cex.main = 1.5)
abline(h = 0)
}
par(mfrow = c(2, 2))
createWeightPlot(1, 6)
createWeightPlot(7, 9)
createWeightPlot(10, 13)
createWeightPlot(14, 14)
We can visually see the combined weight of every varaible (the plot is split into 4 sections to account for the different coefficient weights). The blue plots represent positive average coefficients, while the red plots represent negative average coefficients. With this data, we can now interpret our variables:
Note: These variables are a cumulative average weight, so they are not going to be completely precise, but it was the easiest way for us to interpret them. More information could be extracted by looking at each individual interaction
income
: Income is essentially zero, meaning that having high or low income family interestingly did not affect the test scores.
distance
: The closer the student’s four-year college was, the worse they would tend to do on the exam. This makes sense as many of the closest colleges to the students would be smaller, less-attended colleges
mcollege
: The results show that if the student’s father is a college graduate, he or she would have more of an advantage when it comes to test score.
fcollege
: The same as mcollege
, but to an even greater extent.
urban
: Students that live in more urbanized areas are more likely to do worse on the standardized test.
gender
: On average, men tended to do worse on the exam
region
: The regions other than west did better on the standardized test
afam
: The African-American students did worse on the test overall compared to the non-Hispanic and non-African-American students, but this could be correlated with many different factors
hispanic
: The Hispanic students did better on the test overall compared to the non-Hispanic and non-African-American students, but this could be correlated with many different factors
home
: Homeowning students tended to do worse on the standardized tests
wage
: The higher the hourly wage in the student’s location, the better they would do on the exam. This makes sense as a higher wage could result in better-funded education
unemp
: The higher the unemployment rate in the county, the better the students would do on the exam. This result is pretty interesting, and if given the time, we would have liked to do future investigation of this
tuition
: The students paying more tuition had higher test scores on average. This makes sense for the same reasoning as wage
education
: The students that pursued higher levels of education had much higher test scores. This is not surprising, as we expect students that pursued more education would be more interested in school, and as a result would have done better on the exam
Overall, many of our results matched our predictions, but it was interesting to see some outliers, such as unemp
!
Now we can use our model to make some predictions
Suppose we have a hispanic 12th grade male with a wage of $15.50. He lives in the suburbs with his college educated parents that make 120k annually and own their home which is 20 miles from M.I.T, a tuition valued at 2.5.
# create the test data
student1_data = data.frame(gender = "male", urban = "no", distance = 2,
tuition = 0.2, ethnicity = "hispanic",
unemp = 2.3, region = "other", wage = 11.50,
fcollege = "yes", mcollege = "yes",
income = "high", home = "yes", education = 12)
student1_data = cleanData(student1_data)
# make the prediction
student1_pred = predict(final_model, newdata = student1_data)
print(student1_pred)
## 1
## 68.97454
Student 1 is predicted to have a test score of 68.9745411
Let’s test some of the other variables in the model with a different set of sample data.
Suppose we have an African-American female who goes to school in an urban area in the west, with an unemployment rate of 20.3%. Her parents, neither of which went to college, make 20k a year and she’s in 10th grade. The average wage of her hometown is 7.50 and tuition of the closest college of distance 0.2 is valued at 2
# create the test data
student2_data = data.frame(gender = "female", urban = "yes", distance = 0.2,
tuition = 2, ethnicity = "afam",
unemp = 20.3, region = "west", wage = 7.50,
fcollege = "no", mcollege = "no",
income = "low", home = "no", education = 10)
student2_data = cleanData(student2_data)
# make the prediction
student2_pred = predict(final_model, newdata = student2_data)
print(student2_pred)
## 1
## 57.02749
Student 2 is predicted to have a test score of 57.0274908
Let’s visualize where these students would be in the sample of all test scores
hist(collegedistance$score,
breaks = 20,
main = "Score Distribution",
xlab = "Test Score",
col = "skyblue")
legend("topleft", c("Student 1", "Student 2"),
col = c("springgreen4", "firebrick"), lwd = 3)
abline(v = student1_pred, col = "springgreen4", lwd = 3)
abline(v = student2_pred, col = "firebrick", lwd = 3)
We outlined two very plausible scenarios, and it was interesting to see how those different parameters played out compared to the rest of the population of scores!
Below is the code we used to visualize possible interaction variables and their effect on score. The function createInteractionPlots()
will save us from needing to repeat our code
createInteractionPlots = function(interaction) {
par(mfrow = c(2, 2))
buildScatterPlot("Score", "Wage", "dodgerblue" , "cadetblue1",
unlist(collegedistance[interaction]))
buildScatterPlot("Score", "Unemp", "seagreen1", "palegreen4",
unlist(collegedistance[interaction]))
buildScatterPlot("Score", "Distance", "firebrick", "tomato",
unlist(collegedistance[interaction]))
buildScatterPlot("Score", "Tuition", "mediumpurple", "orchid2",
unlist(collegedistance[interaction]))
}
createInteractionPlots("gender")
createInteractionPlots("fcollege")
createInteractionPlots("mcollege")
createInteractionPlots("home")
createInteractionPlots("urban")
createInteractionPlots("income")
createInteractionPlots("region")
createInteractionPlots("hispanic")
createInteractionPlots("afam")
This project was created as a final project for STAT 420. We would like to give a special thank you to Professor David Unger and the STAT 420 grading team for making this semester possible, despite the challenges of being online
Authors