Classification problems occur often, perhaps even more so than regression problems. Consider the Cortez student maths attainment data discussed in previous posts. The response variable, final grade of the year (range 0-20),
G3 can be classified into a binary pass or fail variable called
final, based on a threshold mark. We used a decision tree approach to model this data before which provided 95% accuracy and had the benefit of interpretability. We will now model this using logistic regression so we can attach probabilities to our student pass or fail predictions.
Make the final grade binary (pass and fail)
G3 is pretty normally distributed, despite the dodgy tail. To simplify matters converted
G3 marks below 10 as a fail, above or equal to 10 as a pass. Often a school is judged by whether students meet a critcal boundary, in the UK it is a C grade at GCSE for example. Rather than modelling this response Y directly, logistic regression models the probability that Y belongs to a particular category.
From our learnings of the decision tree we can include the variables that were shown to be important predictors in this multiple logistic regression.
- Using the training data estimate the regression coefficients using maximum likelihood.
- Use these coefficients to predict the test data and compare with reality.
- Evaluate the binary classifier with receiver operating characteristic curve (ROC).
- Evaluate the logistic regression performance with the resampling method cross-validation
Training and test datasets.
We need to split the data so we can build the model and then test it, to see if it generalises well. The data arrived in a random order.
Now we need to train the model using the data. From our decision tree we know that the prior attainment data variables
G2 are important as are the
reason variables. We fit a logistic regression model in order to predict
final using the variables mentioned in the previous sentence.
The model does appear to suffer from overdispersion. The p-values associated with
reason are all non-significant. Following Crawley’s recommendation we attempt model simplification by removing this term from the model after changing the model family argument to
family = quasibinomial.
We use the more conservative “F-test” to compare models due to the quasibinomial error distribution, after Crawley.
No difference in explanatory power between the models. There is no evidence that
reason is associated with a students pass or fail in their end of year maths exam. We continue model simplification after using
summary() (not shown).
We don’t need the earlier
G1 exam result as we have
G2 in the model already. What happens if we remove
We lose explanatory power, we need to keep
Fjob in the model. This gives us our minimal adequate model.
Fjob is a useful predictor but perhaps we could reduce the number of levels by recoding the variable as only some of the jobs seem useful as predictors.
For a better understanding of how R dealt with the categorical variables, we can use the
contrasts() function. This function will show us how the variables have been dummyfied by R and how to interpret them in a model. Note how the default in R is to use alphabetical order.
The smallest p-value here is assocaited with
G2. The positive coefficient for this predictor suggests that an increase in
G2 is associated increase in the probability of
final = pass. To be precise a one-unit increase in
G2 is associated with an increase in the log odds of
pass by 2.0357671.
The first command predicts the probability of the test students’ characteristics resulting in a
pass based on the
glm() built using the training data. The second and third command creates a vector of 45
fails with those probabilities greater than 50% being converted into
pass. The predicted passes and failures are compared with the real ones in a table with a test error of 4.444%.
As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.
The 0.95 accuracy on the test set is quite a good result and an AUC of 0.9884454. However, keep in mind that this result is somewhat dependent on the manual split of the data that I made earlier, therefore if you wish for a more precise score, you would be better off running some kind of cross validation such as k-fold cross validation. The logistic regression also provides coefficients allowing a quantitative understanding of the association between a variable and the odss of success which can be useful.
Leave-one-out cross-validation for Generalized Linear Models
As mentioned above let’s conduct a cross validation using the
cv.glm() function from the boot package.This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. We produce our model
glm.fit based on our earlier learnings. We follow guidance of the Chapter 5.3.2 cross-validation lab session in James et al., 2014.
cv.glm() function produces a list with several components. The two numbers in the
delta vector contain the cross-validation results. Our cross-validation estimate for the test error is approximately 0.056.
cv.glm() function can also be used to implement k-fold cross-validation. Below we use k = 10, a common choice for k, on our data.
On this data set, using this model, the two estimates are very close for K = 1 and K = 10. The error estimates are small, suggesting the model may perform OK if applied to predict future student
final pass or fail.
- Cortez and Silva (2008). Using data mining to predict secondary school performance.
- Crawley (2004). Statistics an introduction using R.
- James et al., (2014). An introduction to statistical learning with applications in R. Springer.