Data Scientist

# Logistic regression for student performance prediction

## Introduction

Classification problems occur often, perhaps even more so than regression problems. Consider the Cortez student maths attainment data discussed in previous posts. The response variable, final grade of the year (range 0-20), G3 can be classified into a binary pass or fail variable called final, based on a threshold mark. We used a decision tree approach to model this data before which provided 95% accuracy and had the benefit of interpretability. We will now model this using logistic regression so we can attach probabilities to our student pass or fail predictions.

## Make the final grade binary (pass and fail)

G3 is pretty normally distributed, despite the dodgy tail. To simplify matters converted G3 marks below 10 as a fail, above or equal to 10 as a pass. Often a school is judged by whether students meet a critcal boundary, in the UK it is a C grade at GCSE for example. Rather than modelling this response Y directly, logistic regression models the probability that Y belongs to a particular category.

From our learnings of the decision tree we can include the variables that were shown to be important predictors in this multiple logistic regression.

## Objective

• Using the training data estimate the regression coefficients using maximum likelihood.
• Use these coefficients to predict the test data and compare with reality.
• Evaluate the binary classifier with receiver operating characteristic curve (ROC).
• Evaluate the logistic regression performance with the resampling method cross-validation

## Training and test datasets.

We need to split the data so we can build the model and then test it, to see if it generalises well. The data arrived in a random order.

Now we need to train the model using the data. From our decision tree we know that the prior attainment data variables G1 and G2 are important as are the Fjob and reason variables. We fit a logistic regression model in order to predict final using the variables mentioned in the previous sentence.

The model does appear to suffer from overdispersion. The p-values associated with reason are all non-significant. Following Crawley’s recommendation we attempt model simplification by removing this term from the model after changing the model family argument to family = quasibinomial.

We use the more conservative “F-test” to compare models due to the quasibinomial error distribution, after Crawley.

No difference in explanatory power between the models. There is no evidence that reason is associated with a students pass or fail in their end of year maths exam. We continue model simplification after using summary() (not shown).

We don’t need the earlier G1 exam result as we have G2 in the model already. What happens if we remove Fjob?

We lose explanatory power, we need to keep Fjob in the model. This gives us our minimal adequate model. Fjob is a useful predictor but perhaps we could reduce the number of levels by recoding the variable as only some of the jobs seem useful as predictors.

## Contrasts

For a better understanding of how R dealt with the categorical variables, we can use the contrasts() function. This function will show us how the variables have been dummyfied by R and how to interpret them in a model. Note how the default in R is to use alphabetical order.

## Model interpretation

The smallest p-value here is assocaited with G2. The positive coefficient for this predictor suggests that an increase in G2 is associated increase in the probability of final = pass. To be precise a one-unit increase in G2 is associated with an increase in the log odds of pass by 2.0357671.

The first command predicts the probability of the test students’ characteristics resulting in a pass based on the glm() built using the training data. The second and third command creates a vector of 45 fails with those probabilities greater than 50% being converted into pass. The predicted passes and failures are compared with the real ones in a table with a test error of 4.444%.

## Model performance

As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.

## Conclusion

The 0.95 accuracy on the test set is quite a good result and an AUC of 0.9884454. However, keep in mind that this result is somewhat dependent on the manual split of the data that I made earlier, therefore if you wish for a more precise score, you would be better off running some kind of cross validation such as k-fold cross validation. The logistic regression also provides coefficients allowing a quantitative understanding of the association between a variable and the odss of success which can be useful.

## Leave-one-out cross-validation for Generalized Linear Models

As mentioned above let’s conduct a cross validation using the cv.glm() function from the boot package.This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. We produce our model glm.fit based on our earlier learnings. We follow guidance of the Chapter 5.3.2 cross-validation lab session in James et al., 2014.

The cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. Our cross-validation estimate for the test error is approximately 0.056.

## k-fold cross-validation

The cv.glm() function can also be used to implement k-fold cross-validation. Below we use k = 10, a common choice for k, on our data.

On this data set, using this model, the two estimates are very close for K = 1 and K = 10. The error estimates are small, suggesting the model may perform OK if applied to predict future student final pass or fail.

## References

• Cortez and Silva (2008). Using data mining to predict secondary school performance.
• Crawley (2004). Statistics an introduction using R.
• James et al., (2014). An introduction to statistical learning with applications in R. Springer.
• http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
• https://archive.ics.uci.edu/ml/datasets/Student+Performance