Introduction

The purpose of our project is to effectively predict the test scores of high schoolers in 1986 based on a large number of predictor variables, some of which are gender, ethnicity, family income, etc.

Our ultimate goal for this project is to reason through the different predictor variables to find the model that can best predict high school test scores.


Dataset

We will be using the College Distance dataset from the AER package that can be found here

The data was obtained by the Department of Education, and contains many different social and economic variables, including: gender, ethnicity, whether or not the mother/father went to college, if the family owns their home, county, unemployment rate, and more.

We first load college_distance.csv into R along with some required packages (hidden):

collegedistance = read.csv("CollegeDistance.csv")

This dataset has a total of 4739 observations of 15 variables (14 predictors and 1 response). The said variables are:

  • gender: a factor indicating gender
  • ethnicity: factor indicating ethnicity (African-American, Hispanic or other)
  • score: base year composite test score
  • fcollege: factor. Is the father a college graduate?
  • mcollege: factor. Is the mother a college graduate?
  • home: factor. Does the family own their home?
  • urban: factor. Is the school in an urban area?
  • unemp: country unemployment rate in 1980
  • wage: state hourly wage in manufacturing in 1980
  • distance: distance from 4-year college (in 10 miles)
  • tuition: average state 4-year college tuition (in 1000 USD)
  • education: number of years of education
  • income: factor. Is the family income above 25,000 USD per year?
  • region: factor indicating region (West or other)

The dataset meets all of the set criteria for the project. Now lets look for missing values for our next step in data cleaning.


Data Cleaning

Loading in the Dataset

We received a warning message when loading the data that there was an unnamed column. Taking a look, we saw that R created another x-coordinate column. We will get rid of that.

collegedistance = collegedistance[ , -1]
head(collegedistance)
##   gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
##   tuition education income region
## 1 0.88915        12   high  other
## 2 0.88915        12    low  other
## 3 0.88915        12    low  other
## 4 0.88915        12    low  other
## 5 0.88915        13    low  other
## 6 0.88915        12    low  other

Next, we will see if there are any missing values within our dataset.

sum(is.na(collegedistance))
## [1] 0

Great! We see that there are no missing values in our dataset, so no additional work needs to be done there.


Converting Factor Variables

We also notice that many of our variables are factor variables, so we need to convert those to binary values. The dependent variables will be converted as follows:

  • ethnicity: converted into two variables, hispanic and afam. A value of 1 means the student is Hispanic or African-American, respectively
  • gender: 1 for male, 0 for female
  • fcollege: 1 for yes, 0 for no
  • mcollege: 1 for yes, 0 for no
  • home: 1 for yes, 0 for no
  • urban: 1 for yes, 0 for no
  • income: 1 for high, 0 for low
  • region: 1 for west, 0 for other

We’ll use a function called cleanData() to modularize our code

cleanData = function(data) {
  # convert the ethnicity label and remove the old one
  data$hispanic = 1 * (data$ethnicity == "hispanic")
  data$afam = 1 * (data$ethnicity == "afam")
  data = data[-1 * which(names(data) == "ethnicity")]
  
  # convert the rest of the labels with automation
  convert_labels = c(c("fcollege", "yes"), c("mcollege", "yes"), c("home", "yes"),
                     c("urban", "yes"), c("income", "high"), c("region", "west"),
                     c("gender", "male"))
  
  # loop through each label
  for (label_index in seq(1, length(convert_labels), 2)) {
    
    # get the column name and positive label name
    col_name = convert_labels[label_index]
    positive_label = convert_labels[label_index + 1]
  
    # convert the label appropriately
    data[col_name] = 1 * (data[col_name] == positive_label)
  }
  
  # return the data
  return(data)
}

Now let’s take a look at the data with our adjusted variables

collegedistance = cleanData(collegedistance)
head(collegedistance)
##   gender score fcollege mcollege home urban unemp wage distance tuition
## 1      1 39.15        1        0    1     1   6.2 8.09      0.2 0.88915
## 2      0 48.87        0        0    1     1   6.2 8.09      0.2 0.88915
## 3      1 48.74        0        0    1     1   6.2 8.09      0.2 0.88915
## 4      1 40.40        0        0    1     1   6.2 8.09      0.2 0.88915
## 5      0 40.48        0        0    0     1   5.6 8.09      0.4 0.88915
## 6      1 54.71        0        0    1     1   5.6 8.09      0.4 0.88915
##   education income region hispanic afam
## 1        12      1      0        0    0
## 2        12      0      0        0    0
## 3        12      0      0        0    0
## 4        12      0      0        0    1
## 5        13      0      0        0    0
## 6        12      0      0        0    0

It appears that none of the other variables need to be changed


Identifying Correlation

Before we start creating our models, we’ll take a look at our variables to ensure that there is no correlation affecting our model

To do this, we’ll need a correlation matrix

# get the correlation matrix
college_cor = round(cor(collegedistance), 2)

# remove the NA values
college_cor[which(is.na(college_cor))] = 0

# summary of the correlation
head(college_cor)
##          gender score fcollege mcollege  home urban unemp  wage distance
## gender     1.00  0.08     0.04     0.02  0.04  0.01 -0.03  0.03     0.00
## score      0.08  1.00     0.25     0.19  0.13 -0.09 -0.03  0.12    -0.07
## fcollege   0.04  0.25     1.00     0.43  0.08 -0.05 -0.10  0.03    -0.11
## mcollege   0.02  0.19     0.43     1.00  0.06 -0.03 -0.09  0.02    -0.08
## home       0.04  0.13     0.08     0.06  1.00 -0.10  0.01  0.07     0.02
## urban      0.01 -0.09    -0.05    -0.03 -0.10  1.00 -0.05 -0.03    -0.29
##          tuition education income region hispanic  afam
## gender      0.01      0.01   0.06  -0.01     0.02 -0.04
## score       0.13      0.47   0.18  -0.03    -0.16 -0.28
## fcollege    0.03      0.28   0.35   0.03    -0.11 -0.10
## mcollege    0.04      0.23   0.25  -0.01    -0.11 -0.02
## home        0.00      0.10   0.14   0.00    -0.06 -0.13
## urban      -0.02     -0.02  -0.07  -0.05     0.08  0.18

This table is a little difficult to analyze, so we’ll convert it to a visual heatmap

# print our heatmap (code hidden above)
ggheatmap

From the above heatmap, we see that there are only a few variables that could be problematic. To further validate our findings, we’ll explore the variance inflation factors for a simple additive model

simple_add = lm(score ~ ., data = collegedistance)
vif(simple_add)
##    gender  fcollege  mcollege      home     urban     unemp      wage  distance 
##  1.009046  1.399584  1.263883  1.051489  1.158750  1.264288  1.225818  1.256151 
##   tuition education    income    region  hispanic      afam 
##  1.869172  1.135972  1.214674  1.578257  1.236430  1.180550

We can see from this model that there are 0 values greater than 5, so there appears to be no collinearity between our dependent variables


Variable Intuition

We will now do a brief summary of all of our variables, just to see if there are any more than we can remove

colMeans(collegedistance)
##     gender      score   fcollege   mcollege       home      urban      unemp 
##  0.4513610 50.8890293  0.2080608  0.1373708  0.8202152  0.2329605  7.5972146 
##       wage   distance    tuition  education     income     region   hispanic 
##  9.5005065  1.8028698  0.8146082 13.8077654  0.2880355  0.1989871  0.1905465 
##       afam 
##  0.1658578

Since all of these arithmetic means look good (especially since none of the factor variables have a mean of 0 or 1), we can start to build our model


Method Exploration

Dependent Variable Transformation

First, we’ll take a look at our dependent variable, score to see if any transformations are necessary. We expect to see a normal distribution

The function buildHistogram() will help us see this visually

buildHistogram = function(values, title) {
  # create the histogram
  freq_hist = hist((values - mean(values)) / sd(values),
                   col = "darkorange",
                   xlab = "Score",
                   main = title)
  
  # overlay the normal curve
  multiplier = freq_hist$counts / freq_hist$density
  x = seq(-3, 3, length.out = length(values))
  curve(multiplier * dnorm(x), col = "dodgerblue", lwd=2, add=TRUE, yaxt="n")
}
buildHistogram(collegedistance$score, "Frequency of Normalized Score")

The above plot looks fairly normal, so we concluded that a dependent variable transformation was not needed in our model


Independent Variable Transformation

We’ll now see if any independent variable transformations, specifically for our numeric variables, might be necessary by using scatter plots and our own statistical judgement and intuition as a starting point

To start, we’ll create a helper function to help us visualize our variables plotted against score

buildScatterPlot = function(dep_label, ind_label, color1, color2 = color,
                            interaction = TRUE) {
  plot(y = unlist(collegedistance[tolower(dep_label)]),
       x = unlist(collegedistance[tolower(ind_label)]),
       xlab = ind_label,
       ylab = dep_label,
       main = paste(dep_label, " vs ", ind_label),
       col = ifelse(interaction, color1, color2),
       pch = 20,
       cex = 1,
       cex.lab = 2,
       cex.axis = 2,
       cex.main = 2)
}

Now we’ll make the plots for our numeric variables

par(mfrow = c(2, 2))
buildScatterPlot("Score", "Wage", "dodgerblue")
buildScatterPlot("Score", "Unemp", "darkorange")
buildScatterPlot("Score", "Distance", "firebrick")
buildScatterPlot("Score", "Tuition", "mediumpurple")

There appears to be no transformations needed, and unforunately our data looks fairly random, possibly indicating that there is no strong trend between our variables

We also created charts to analyze simple interaction variables, but we were unsuccessful in finding trends between these interactions and score. This code can be found in the appendix of our project, under Interaction Visualization. Hence, it’s difficult for us to visually extract trends, and we must now rely on statistcal metrics to continue


Assumption Functions

To make it easier for us to narrow down our options when selecting a model, we will check our assumptions using a variety of functions/tests.

# performs the Breusch-Pagan test
get_bp = function(model, alpha = 0.01) {
  bptest_res = bptest(model)
  decision = bptest_res$p.value < alpha
  return(list(decision = ifelse(decision, "Reject", "Fail To Reject"),
              stat = bptest_res$statistic, pvalue = bptest_res$p.value))
}

# performs the Shapiro-Wilk test
get_shapiro = function(model, alpha = 0.01) {
  shapiro_res = shapiro.test(resid(model))
  decision = shapiro_res$p.value < alpha
  return(list(decision = ifelse(decision, "Reject", "Fail to Reject"),
              stat = shapiro_res$statistic, pvalue = shapiro_res$p.value))
}

# finds the RMSE of our model
get_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

# combines the above helper methods into a readable format
evaluate = function(name, model, response_func = I, data = collegedistance) {
  set.seed(21)
  data.frame(Model = name,
             rmse = get_rmse(model),
             adj_r2 = rsq(model, adj = TRUE),
             aic = AIC(model),
             bic = BIC(model),
             coeff = length(coef(model)),
             shapiro_dec = get_shapiro(model)$decision,
             bp_dec = get_bp(model)$decision)
}


Model Building + Testing

Basic Model Building

Next, let’s start creating some basic models.

# Start with a model with all predictors.
full_score_model = lm(score ~ ., data = collegedistance)
full_score_eval = evaluate("All Additive Model", full_score_model)
full_score_eval
##                 Model     rmse   adj_r2      aic      bic coeff shapiro_dec
## BP All Additive Model 7.107605 0.334981 32038.51 32141.92    15      Reject
##    bp_dec
## BP Reject

Now that we have a baseline quality criterion, let’s compare this to a model with the predictors chosen to be the most significant from the heat map.

# Comparing with a model using selected predictors.
smaller_add_model = lm(score ~ education + fcollege + mcollege + wage + tuition +
                               gender + home + income, data = collegedistance)
smaller_add_eval = evaluate("Smaller Add Model", smaller_add_model)
smaller_add_eval
##                Model     rmse    adj_r2      aic      bic coeff shapiro_dec
## BP Smaller Add Model 7.499351 0.2586984 32547.14 32611.77     9      Reject
##    bp_dec
## BP Reject

We see that the additive model with chosen predictors has a lower adjusted \(r^2\) and a higher BIC meaning the model wasn’t improved by removing those predictors. Let’s try comparing it to the interaction model.

full_int_mod = lm(score ~ .^2, data = collegedistance)
full_int_eval = evaluate("All Interaction Model", full_int_mod)
full_int_eval
##                    Model     rmse    adj_r2      aic      bic coeff shapiro_dec
## BP All Interaction Model 7.130664 0.3437096 32064.74 32749.88   106      Reject
##    bp_dec
## BP Reject

We see that the interaction model has a higher adjusted \(r^2\) than both the full and smaller model. Let’s use an ANOVA \(F\)-test to choose between the All Additive Model and All Interaction Model.

anova(full_score_model, full_int_mod)
## Analysis of Variance Table
## 
## Model 1: score ~ gender + fcollege + mcollege + home + urban + unemp + 
##     wage + distance + tuition + education + income + region + 
##     hispanic + afam
## Model 2: score ~ (gender + fcollege + mcollege + home + urban + unemp + 
##     wage + distance + tuition + education + income + region + 
##     hispanic + afam)^2
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4724 237888                                  
## 2   4634 230293 90      7595 1.6981 5.067e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is extremely low 5.066955210^{-5}, so between the two, we choose the interaction model to begin narrowing down our predictors.


Information Criteria Selection

Also, we can see that the interaction model had a smaller AIC than BIC, so we will move forward with using backwards AIC to search for and narrow down parameters.

We first begin by using backwards AIC to do a backwards parameter search.

full_int_mod = lm(score ~ . ^ 2, data = collegedistance)
distance_aic = step(full_int_mod, direction = "backward", trace = 0)
coef(distance_aic)
##        (Intercept)             gender           fcollege           mcollege 
##        27.51578466        -8.45366574         0.81384297         3.99964110 
##               home              urban              unemp               wage 
##         0.18519034        -2.03672698        -0.06018998        -0.10246390 
##           distance            tuition          education             income 
##        -0.44747309         1.37291771         1.80866472        -0.97737050 
##             region           hispanic               afam        gender:home 
##         2.97861345        -5.12992820       -13.38382092        -0.86720151 
##       gender:urban        gender:wage   gender:education    gender:hispanic 
##         0.77074554         0.38898937         0.43096415         1.14001836 
##        gender:afam    fcollege:income mcollege:education    mcollege:income 
##         1.27223670         0.84145397        -0.25943771         1.72731246 
##         home:urban         home:unemp      home:distance       home:tuition 
##         1.48638416        -0.19985687         0.18255690         1.69659493 
##        home:region     unemp:distance        wage:region          wage:afam 
##         1.36196167         0.03505865         0.42560855         0.76479061 
##  distance:hispanic      distance:afam     tuition:region   tuition:hispanic 
##        -0.18692067        -0.34256326        -5.12187834         1.54255097 
##   education:region    income:hispanic 
##        -0.35646928         1.40776578

Now that we have our significant coefficients, we can attempt to create a larger model, and run AIC once again

distance_larger = lm(score ~ .^2 + I(gender^2) + I(fcollege^2) + I(mcollege^2) +
                             I(home^2) + I(urban^2) + I(unemp^2) + I(wage^2) +
                             I(distance^2) + I(tuition^2) + I(education^2) +
                             I(hispanic^2) + I(afam^2), 
                     data = collegedistance)
distance_aic2 = step(distance_larger, direction = "backward", trace = 0)
coef(distance_aic2)
##       (Intercept)            gender          fcollege          mcollege 
##       -6.92257098       -8.81842187        0.75287995        0.24920943 
##              home             urban             unemp              wage 
##        0.16661740       -0.70409981        0.03838828       -0.08707851 
##          distance           tuition         education            income 
##       -0.41932121        1.02692976        6.77157103       -0.99153378 
##            region          hispanic              afam     I(distance^2) 
##        1.36351911       -5.84466691      -13.55114414        0.02622444 
##    I(education^2)       gender:home      gender:urban       gender:wage 
##       -0.17663289       -0.79186265        0.76748589        0.28024332 
##    gender:tuition  gender:education     gender:region   gender:hispanic 
##        1.25447229        0.43641796        1.05006853        1.25172546 
##       gender:afam   fcollege:income   mcollege:income        home:urban 
##        1.31295054        0.89990994        1.65491986        1.45738018 
##        home:unemp     home:distance      home:tuition       home:region 
##       -0.20758448        0.22831047        1.64485020        1.29236644 
##       urban:unemp    unemp:hispanic       wage:region         wage:afam 
##       -0.19133902        0.22898793        0.54209956        0.76591562 
## distance:hispanic     distance:afam    tuition:region  education:region 
##       -0.26260115       -0.33830105       -5.28140769       -0.35341008 
##   income:hispanic 
##        1.55536706

Now we’ll evaluate our first AIC model, which has 38 coefficients

evaluate("All Interactions", distance_aic)
##               Model     rmse    adj_r2      aic      bic coeff shapiro_dec
## BP All Interactions 7.055426 0.3478309 31968.91 32220.99    38      Reject
##    bp_dec
## BP Reject

And our second AIC model, which has 41 coefficients

evaluate("Some interactions", distance_aic2)
##                Model     rmse    adj_r2      aic      bic coeff shapiro_dec
## BP Some interactions 7.042325 0.3505196 31952.31 32223.78    41      Reject
##    bp_dec
## BP Reject


Creating a Larger Model

By looking at the parameters returned by backwards AIC, we can narrow them down to create a ‘better’ model

Note: This model does have too many parameters, but we will factor in the number of parameters used when looking at metrics like Adjusted R Squared and ANOVA test statistics

# create the model with large parameters
large_param_model = lm(score ~ gender + fcollege + fcollege:mcollege + home + 
                               urban + wage + distance + I(distance^(0.25)) +
                               tuition + education + income + hispanic + afam +
                               I(distance^2) + I(tuition^2) + I(education^2) +
                               gender:home + gender:wage + gender:education +
                               gender:hispanic + gender:afam + fcollege:income +
                               mcollege:income + home:urban + home:unemp +
                               home:distance + unemp:distance +
                               unemp:hispanic + wage:hispanic + wage:afam +
                               distance:hispanic + distance:afam + 
                               tuition:education + income:hispanic + I(tuition^3) +
                               log(tuition) + I(education^3)+ log(education) +
                               I(tuition^0.25) + I(education^0.25) + I(wage^2) +
                               gender:I(wage^2) + gender:I(log(wage)) +
                               hispanic:I(wage^2) + hispanic:I(log(wage)) +
                               home:I(unemp^2) + home:I(unemp^0.25) +
                               home:I(log(unemp)) + home:I(distance^2) +
                               home:I(distance^0.25) +  unemp:I(distance^2) +
                               unemp:I(distance^0.25) + hispanic:I(distance^2) +
                               hispanic:I(distance^0.25) + afam:I(distance^2) +
                               afam:I(distance^0.25) + region, 
                       data = collegedistance)

# call the helper function
evaluate("Large Parameter Model", large_param_model)
##                    Model     rmse   adj_r2      aic      bic coeff shapiro_dec
## BP Large Parameter Model 7.045638 0.351877 31959.21 32340.56    58      Reject
##    bp_dec
## BP Reject

By running this model through our quality criterion function, we see that our LOOCV-RMSE, adjusted \(R^2\), AIC and BIC numbers have improved significantly. However, the assumption tests still fail.

Finally, we will perform ANOVA tests on all three of our contenders to erase any doubt when choosing the best model to move forward with.

anova(distance_aic, distance_aic2, test = "F")
## Analysis of Variance Table
## 
## Model 1: score ~ gender + fcollege + mcollege + home + urban + unemp + 
##     wage + distance + tuition + education + income + region + 
##     hispanic + afam + gender:home + gender:urban + gender:wage + 
##     gender:education + gender:hispanic + gender:afam + fcollege:income + 
##     mcollege:education + mcollege:income + home:urban + home:unemp + 
##     home:distance + home:tuition + home:region + unemp:distance + 
##     wage:region + wage:afam + distance:hispanic + distance:afam + 
##     tuition:region + tuition:hispanic + education:region + income:hispanic
## Model 2: score ~ gender + fcollege + mcollege + home + urban + unemp + 
##     wage + distance + tuition + education + income + region + 
##     hispanic + afam + I(distance^2) + I(education^2) + gender:home + 
##     gender:urban + gender:wage + gender:tuition + gender:education + 
##     gender:region + gender:hispanic + gender:afam + fcollege:income + 
##     mcollege:income + home:urban + home:unemp + home:distance + 
##     home:tuition + home:region + urban:unemp + unemp:hispanic + 
##     wage:region + wage:afam + distance:hispanic + distance:afam + 
##     tuition:region + education:region + income:hispanic
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)    
## 1   4701 232156                                
## 2   4698 231051  3    1104.7 7.487 5.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • \(H_0\): There is a small difference in RSS between the two models.
  • The test statistic: 7.487
  • The distribution of the test statistic under the null hypothesis: F-Distribution with 5 and 4704 DF
  • The p-value: 5.359730410^{-5}
  • A decision: We reject the null hypothesis at \(\alpha = 0.10\)
  • Preferred model: distance_aic2

Now, let’s compare distance_aic2 with large_param_model.

anova(distance_aic2, large_param_model, test = "F")
## Analysis of Variance Table
## 
## Model 1: score ~ gender + fcollege + mcollege + home + urban + unemp + 
##     wage + distance + tuition + education + income + region + 
##     hispanic + afam + I(distance^2) + I(education^2) + gender:home + 
##     gender:urban + gender:wage + gender:tuition + gender:education + 
##     gender:region + gender:hispanic + gender:afam + fcollege:income + 
##     mcollege:income + home:urban + home:unemp + home:distance + 
##     home:tuition + home:region + urban:unemp + unemp:hispanic + 
##     wage:region + wage:afam + distance:hispanic + distance:afam + 
##     tuition:region + education:region + income:hispanic
## Model 2: score ~ gender + fcollege + fcollege:mcollege + home + urban + 
##     wage + distance + I(distance^(0.25)) + tuition + education + 
##     income + hispanic + afam + I(distance^2) + I(tuition^2) + 
##     I(education^2) + gender:home + gender:wage + gender:education + 
##     gender:hispanic + gender:afam + fcollege:income + mcollege:income + 
##     home:urban + home:unemp + home:distance + unemp:distance + 
##     unemp:hispanic + wage:hispanic + wage:afam + distance:hispanic + 
##     distance:afam + tuition:education + income:hispanic + I(tuition^3) + 
##     log(tuition) + I(education^3) + log(education) + I(tuition^0.25) + 
##     I(education^0.25) + I(wage^2) + gender:I(wage^2) + gender:I(log(wage)) + 
##     hispanic:I(wage^2) + hispanic:I(log(wage)) + home:I(unemp^2) + 
##     home:I(unemp^0.25) + home:I(log(unemp)) + home:I(distance^2) + 
##     home:I(distance^0.25) + unemp:I(distance^2) + unemp:I(distance^0.25) + 
##     hispanic:I(distance^2) + hispanic:I(distance^0.25) + afam:I(distance^2) + 
##     afam:I(distance^0.25) + region
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   4698 231051                              
## 2   4681 229734 17    1317.2 1.5788 0.06098 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • \(H_0\): There is a small difference in RSS between the two models.
  • The test statistic: 1.5787861
  • The distribution of the test statistic under the null hypothesis: F-Distribution with 23 and 4658 DF
  • The p-value: 0.0609818
  • A decision: We reject the null hypothesis at \(\alpha = 0.10\)
  • Preferred model: large_param_model

Great! We’ve confirmed that large_param_model is indeed the preferred model of all three, and now we will move on to finalizing the model.


Finalizing the Model

Before we finalize our model, we also need to make sure that we remove all of the unusual points in our data set

The function getUnsualPoints() below will help us streamline the process of finding these points

# finds the unusual points (leverage, residual, influential) of a certain model
getUnusualPoints = function(model) {
  # calculate leverage
  num_high_leverage = sum(hatvalues(model) > 2 * mean(hatvalues(model)))
  
  # calculate large residuals
  num_large_residual = sum(abs(rstandard(model)) > 2)
  
  # calculate influential
  cooks_dist = cooks.distance(model)
  num_influential = sum(cooks_dist > 4 / length(cooks_dist))
  
  # return the values
  return(list(leverage = num_high_leverage,
              residual = num_large_residual,
              influential = num_influential))
}

Let’s run the above function with our current best model

unusual_point_data = getUnusualPoints(large_param_model)
print(unusual_point_data)
## $leverage
## [1] 267
## 
## $residual
## [1] 182
## 
## $influential
## [1] 206

We find that we have 206 high influence points and 182 large residual points points, so we will remove them and create our final model

We will repeat this process until the number of observations in our data set is still in the required amount of 2000, or if there’s nothing else to remove

# initialize our final model and cleaned dataset
final_model = large_param_model
collegedistance_cleaned = collegedistance

while(nrow(collegedistance_cleaned) > 2000) {
  
  # calculate influence indexes
  infl_indexes = which(cooks.distance(final_model) > 4 /
                         length(cooks.distance((final_model))))
  
  # calculate residual indexes
  resid_indexes = which(abs(rstandard(final_model)) > 2)
  
  # get all unique subset indexes
  subset_indexes = unique(c(infl_indexes, resid_indexes))
  
  # break cases
  if (nrow(collegedistance_cleaned) - length(subset_indexes) < 2000) {
    break
  }
  if (length(subset_indexes) == 0) {
    break
  }
  
  # adjust the data set
  collegedistance_cleaned = collegedistance_cleaned[-1 * subset_indexes,]
  
  # recalculate the final model
  final_model = lm(score ~ gender + fcollege + fcollege:mcollege + home + 
                               urban + wage + distance + I(distance^(0.25)) +
                               tuition + education + income + hispanic + afam +
                               I(distance^2) + I(tuition^2) + I(education^2) +
                               gender:home + gender:wage + gender:education +
                               gender:hispanic + gender:afam + fcollege:income +
                               mcollege:income + home:urban + home:unemp +
                               home:distance + unemp:distance +
                               unemp:hispanic + wage:hispanic + wage:afam +
                               distance:hispanic + distance:afam + 
                               tuition:education + income:hispanic + I(tuition^3) +
                               log(tuition) + I(education^3)+ log(education) +
                               I(tuition^0.25) + I(education^0.25) + I(wage^2) +
                               gender:I(wage^2) + gender:I(log(wage)) +
                               hispanic:I(wage^2) + hispanic:I(log(wage)) +
                               home:I(unemp^2) + home:I(unemp^0.25) +
                               home:I(log(unemp)) + home:I(distance^2) +
                               home:I(distance^0.25) +  unemp:I(distance^2) +
                               unemp:I(distance^0.25) + hispanic:I(distance^2) +
                               hispanic:I(distance^0.25) + afam:I(distance^2) +
                               afam:I(distance^0.25) + region, 
                       data = collegedistance_cleaned)
}

After performing this final step, we obtain the following test results:

final_eval_data = evaluate("Final Model", final_model)
print(final_eval_data)
##          Model     rmse    adj_r2      aic      bic coeff shapiro_dec bp_dec
## BP Final Model 3.235253 0.8361153 13925.24 14272.98    58      Reject Reject

We are left with a model with RMSE of 3.2352525 and Adjusted R Squared of 0.8361153. Removing these volatile entries from our data set greatly increased the accuracy of our model


Model Results

Model Diagnostics

We will now test the various diagnostics and assumptions of our final model

The helper functions below will help us create the visuals we need:

# create the fitted versus residuals scatter plot
createFittedResidualsPlot = function(model) {
  plot(fitted(model), resid(model), col = "grey", pch = 20,
       xlab = "Fitted", ylab = "Residuals", 
       main = "Fitted vs Residuals of Final Model")
  abline(h = 0, col = "darkorange", lwd = 2)
}

# creates a histogram of residuals
createResidualHistogram = function(model) {
  hist(resid(model),
     xlab   = "Residuals",
     main   = "Histogram of Residuals for Final Model",
     col    = "darkorange",
     border = "dodgerblue",
     breaks = 20)
}

# creates a Q-Q plot
createQQPlot = function(model) {
  qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey", 
         cex.main = 2, cex.axis = 2, cex.lab = 2)
  qqline(resid(model), col = "dodgerblue", lwd = 2)
}

First, we will take a look at the residuals of our model

par(mfrow = c(1, 2))
createResidualHistogram(final_model)
createFittedResidualsPlot(final_model)

Our residuals have a fairly normal distribution, albeit slightly more similar to a t-distribution, and there doesn’t seem to be any general trend in the scatter plot

We can double-check this with the Breusch-Pagan test

# obtain our B-P test results
final_bp = get_bp(final_model)
print(final_bp)
## $decision
##       BP 
## "Reject" 
## 
## $stat
##       BP 
## 144.7343 
## 
## $pvalue
##           BP 
## 1.419751e-09

Unforunately, when we run this test, we obtain a test statistic of 144.7342608, giving us a p-value of 1.419750910^{-9}. It makes sense that we obtained visual results for homoscedasticity but statistical results for heterocedasticity, as we have shown that our data was difficult to predict with a model


Now we’ll take a look at the Q-Q plot for our final model

createQQPlot(final_model)

We notice that visually, our Q-Q plot is not perfectly fitting our data. We can verify that this assumption is indeed violated with the Shapiro-Wilks test

final_shapiro = get_shapiro(final_model)
print(final_shapiro)
## $decision
## [1] "Reject"
## 
## $stat
##         W 
## 0.9790732 
## 
## $pvalue
## [1] 1.957788e-19

This test gives us a test statistic of 0.9790732 and thus a p-value of 1.957788210^{-19}, meaning that we reject \(H_0\). Hence, this assumption is violated


Finally, we’ll look at the number of unusual points in our data

final_unusual = getUnusualPoints(final_model)
print(final_unusual)
## $leverage
## [1] 144
## 
## $residual
## [1] 0
## 
## $influential
## [1] 0

Our data has 144 points of high leverage, 0 points of with large residual, and 0 influential points. Even with intensive data cleaning, these results prove our suspicions that the data selected is highly volatile


Discussion

Process of choosing “Best” Model

We began with a dataset of 4739 observations of 15 variables. We first cleaned our data and checked for any missing values. Fortunately, we did not have any and moved on to creating additional factor variables. Then we took a look at correlation between our predictors and observation variables in order to narrow down our method of choice in the ‘Method Exploration’ section. We decided upon using backwards AIC as our method of choice to narrow down predictors and successfully ended up with a promising model (a decision enforced by our ANOVA testing). To further improve upon our promising model, we applied a variety of response and predictor transformations and removed unusal points, ultimately giving us our “best” model.


Practicality of “Best” Model

After going through the proccesses of choosing a “best” model, the model that performed the best was the polished large_param_model, which modeled score against a number of quadratic, interaction, and additive predictors chosen using backwards AIC. However, even after utilizing backwards AIC, the model still had a relatively low \(R^2\) value. This was significantly improved upon by removing the majority of any unusual points. With the final polished large_param_model, the relatively low cross validated \(RMSE\) and high adjusted \(R^2\) suggests that the model does an excellent job at predicting test scores. Ultimately, we believe that we succeeded in building a model that was accurate in predicting high schoolers’ test scores based off of a number of carefully chosen predictor variables.


Discussion of Coefficients

The main purpose of this study was to see how different social and economic factors can affect a student’s ability to perform in school. In this section, we will interpret the coefficients of our model

First we’ll create the dataframe with the data we need

# initialize the dataframe
coef_df = data.frame(matrix(ncol = length(names(collegedistance)) - 1, nrow = 1))
score_index = which(names(collegedistance) == "score")
colnames(coef_df) = names(collegedistance)[-1 * score_index]

# iterate through each name
for (name in names(collegedistance)[-1 * score_index]) {
  
  # get the sum of coefficients
  coef_indexes = grepl(name, names(coef(final_model)))
  coef_avg = mean(coef(final_model)[coef_indexes])
  
  # add to the dataframe
  coef_df[name] = coef_avg
}

# melt into our data
coef_data = melt(coef_df)
## No id variables; using all as measure variables
# sort by value
coef_data = coef_data[order(abs(coef_data$value)),]

Now we’ll present this data in a visual form.

createWeightPlot = function(start, end) {
  barplot(height = coef_data[start:end,2],
        names.arg = coef_data[start:end,1],
        col = ifelse(coef_data[start:end,2] < 0, "tomato", "skyblue2"),
        border = ifelse(coef_data[start:end,2] < 0, "darkred", "darkblue"),
        xlab = "Variable Name",
        ylab = "Average Weight",
        main = "Average Weight of Variables",
        cex.lab = 1.5,
        cex.axis = 1.5,
        cex.main = 1.5)
  abline(h = 0)
}
par(mfrow = c(2, 2))
createWeightPlot(1, 6)
createWeightPlot(7, 9)
createWeightPlot(10, 13)
createWeightPlot(14, 14)

We can visually see the combined weight of every varaible (the plot is split into 4 sections to account for the different coefficient weights). The blue plots represent positive average coefficients, while the red plots represent negative average coefficients. With this data, we can now interpret our variables:

Note: These variables are a cumulative average weight, so they are not going to be completely precise, but it was the easiest way for us to interpret them. More information could be extracted by looking at each individual interaction

  • income: Income is essentially zero, meaning that having high or low income family interestingly did not affect the test scores.

  • distance: The closer the student’s four-year college was, the worse they would tend to do on the exam. This makes sense as many of the closest colleges to the students would be smaller, less-attended colleges

  • mcollege: The results show that if the student’s father is a college graduate, he or she would have more of an advantage when it comes to test score.

  • fcollege: The same as mcollege, but to an even greater extent.

  • urban: Students that live in more urbanized areas are more likely to do worse on the standardized test.

  • gender: On average, men tended to do worse on the exam

  • region: The regions other than west did better on the standardized test

  • afam: The African-American students did worse on the test overall compared to the non-Hispanic and non-African-American students, but this could be correlated with many different factors

  • hispanic: The Hispanic students did better on the test overall compared to the non-Hispanic and non-African-American students, but this could be correlated with many different factors

  • home: Homeowning students tended to do worse on the standardized tests

  • wage: The higher the hourly wage in the student’s location, the better they would do on the exam. This makes sense as a higher wage could result in better-funded education

  • unemp: The higher the unemployment rate in the county, the better the students would do on the exam. This result is pretty interesting, and if given the time, we would have liked to do future investigation of this

  • tuition: The students paying more tuition had higher test scores on average. This makes sense for the same reasoning as wage

  • education: The students that pursued higher levels of education had much higher test scores. This is not surprising, as we expect students that pursued more education would be more interested in school, and as a result would have done better on the exam

Overall, many of our results matched our predictions, but it was interesting to see some outliers, such as unemp!


Sample Predictions

Now we can use our model to make some predictions


Student 1 Prediction

Suppose we have a hispanic 12th grade male with a wage of $15.50. He lives in the suburbs with his college educated parents that make 120k annually and own their home which is 20 miles from M.I.T, a tuition valued at 2.5.

# create the test data
student1_data = data.frame(gender = "male", urban = "no", distance = 2, 
                           tuition = 0.2, ethnicity = "hispanic", 
                           unemp = 2.3, region = "other", wage = 11.50, 
                           fcollege = "yes", mcollege = "yes", 
                           income = "high", home = "yes", education = 12)
student1_data = cleanData(student1_data)

# make the prediction
student1_pred = predict(final_model, newdata = student1_data)
print(student1_pred)
##        1 
## 68.97454

Student 1 is predicted to have a test score of 68.9745411


Student 2 Prediction

Let’s test some of the other variables in the model with a different set of sample data.

Suppose we have an African-American female who goes to school in an urban area in the west, with an unemployment rate of 20.3%. Her parents, neither of which went to college, make 20k a year and she’s in 10th grade. The average wage of her hometown is 7.50 and tuition of the closest college of distance 0.2 is valued at 2

# create the test data
student2_data = data.frame(gender = "female", urban = "yes", distance = 0.2, 
                           tuition = 2, ethnicity = "afam", 
                           unemp = 20.3, region = "west", wage = 7.50, 
                           fcollege = "no", mcollege = "no", 
                           income = "low", home = "no", education = 10)
student2_data = cleanData(student2_data)

# make the prediction
student2_pred = predict(final_model, newdata = student2_data)
print(student2_pred)
##        1 
## 57.02749

Student 2 is predicted to have a test score of 57.0274908


Visualization

Let’s visualize where these students would be in the sample of all test scores

hist(collegedistance$score,
     breaks = 20,
     main = "Score Distribution",
     xlab = "Test Score",
     col = "skyblue")

legend("topleft", c("Student 1", "Student 2"), 
       col = c("springgreen4", "firebrick"), lwd = 3)


abline(v = student1_pred, col = "springgreen4", lwd = 3)
abline(v = student2_pred, col = "firebrick", lwd = 3)

We outlined two very plausible scenarios, and it was interesting to see how those different parameters played out compared to the rest of the population of scores!


Appendix

Interaction Visualization

Below is the code we used to visualize possible interaction variables and their effect on score. The function createInteractionPlots() will save us from needing to repeat our code

createInteractionPlots = function(interaction) {
    par(mfrow = c(2, 2))
    buildScatterPlot("Score", "Wage", "dodgerblue" , "cadetblue1",
                     unlist(collegedistance[interaction]))
    buildScatterPlot("Score", "Unemp", "seagreen1", "palegreen4",
                     unlist(collegedistance[interaction]))
    buildScatterPlot("Score", "Distance", "firebrick", "tomato",
                     unlist(collegedistance[interaction]))
    buildScatterPlot("Score", "Tuition", "mediumpurple", "orchid2",
                     unlist(collegedistance[interaction]))
}


Gender

createInteractionPlots("gender")


Father Went to College?

createInteractionPlots("fcollege")


Mother Went to College?

createInteractionPlots("mcollege")


Home

createInteractionPlots("home")


Urban

createInteractionPlots("urban")


Income

createInteractionPlots("income")


Region

createInteractionPlots("region")


Hispanic

createInteractionPlots("hispanic")


African American

createInteractionPlots("afam")


Acknowledgements

This project was created as a final project for STAT 420. We would like to give a special thank you to Professor David Unger and the STAT 420 grading team for making this semester possible, despite the challenges of being online

Authors

  • Allison Zhang
  • Kobe Dela Cruz
  • Nishant Balepur