To understand the different layers of a full-stack development it can be useful to produce a reference deployment of your model. This can be a good way to jump-start deployment as it can allow experienced engineers (who are better suited to true production deployment) to tinker and experiment with your work, test corner cases and build acceptance tests.
We’ll work through using the Student Performance dataset that we have seen a few times on this blog. We are interested in predicting whether students are likely to pass or fail their end of year exam (G3 variable above a made-up threshold of 10). Again we use the Maths results only reading off the web from our Github data repo.
To help with the wrangling and tidying of data, I have developed a series of data stories on Github which provide some standard useful code for preparing and exploring data. We employ some of that here. Given our history with this data we don’t go into detail. See if you can follow the code.
Training and test datasets
We need to split the data so we can build the model and then test it, to see if it generalises well. This gives us confidence in the external validity of the model. The data arrived in a random order thus we don’t need to worry about sampling at random.
Building the model
Prior to building the model we prepare some model evaluation tools to report the model quality. As a reminder the random forest approach is useful as it tries to de-correlate the trees of which it is ensembled by randomising the set of variables that each tree is allowed to use. It also initiates by drawing a bootstrapped sample from the training data.
We train a simple random forest classifier.
Notice the negligible fall-off from training to test performance, the default random forest provided an OK fit. However, we are more interested in the export of the model, so we move on to that now. If interested run this code to examine variable importance (try to guess what variables are probably the most useful for predicting end of year exam performance?).
Deploying models by export
Training the model is the hard part, lets export our finished model for use by other systems. When exporting the model we let our development partners deal with the difficult parts of development for production. We chose the randomForest function as the help suggests that the underlying trees are accessible using the getTree function. Our Forest is big but simple.
Save the workspace
Training the model and exporting it are likely to happen at different times. We can save the workspace that includes the random forest model and load it along with the randomForest library prior to export at a later date if required. We show how to save the workspace below, or you could save the randomForest object using the saveRDS function.
A random forest model is a collection of decision trees. A decision tree is a series of tests traditionally visualised as a diagram of decision nodes. With the random forest saved as an object we can define a function that joins the tree tables from the random forest getTree method into one large table of trees. This can then be exported as a table representation of the random forest model that can be used by developers.
We look at the first decision tree from our random forest model, fmodel. We can also count the number of rows in the decision table.
And see the output as a matrix. We could export like this if we want to avoid characters.
Interpreting the decision tree as a table
Read the help using ?getTree. We set the argument for labelVar=TRUE below to provide better human readable labels for our splitting variables and predicted class providing the output as a dataframe.
We demonstrate the interpretation using an example. Imagine you had a test case for the student Joe Bloggs; a non-romantic student, who has failed three times before and with first term (G1) scaled attainment score of 0.22. Joe has promised he has turned over a new leaf since hearing about the use of machine learning in his school!
We start at the first row and will proceed until we have a prediction for our student at a terminal node (a row with the status variable as -1 and left daughter and right daughter variables as zero; e.g. rows 6, 9 and 10 are terminal nodes).
Start at row one and ask has your student failed the exam fewer times than the split point?
For numerical predictors, data with values of the variable less than or equal to the splitting point go to the left daughter node. Our student failed three times and this is greater than the split point. However, we must be careful and remember to transform our inputs in the same way we did for training our model, we could do this by getting the percentile our student’s number of failures is in and reminding ourselves of the distribution of the failures variable (during production this would be automated, we show it here for understanding).
Three failures is the maximum seen and was therefore scaled to one. One is greater than the split point therefore we proceed to the right daughter row of the decision table (row 3).
At row 3 we ask Joe Bloggs whether his G1 scaled score was less than 0.28?
Joe scored 0.22 which is less than 0.28, thus we proceed to the left daugther.
At row 6 we notice zeroes and NA, we also notice a status of -1. We are at a terminal node! A decision has been made, Joe Bloggs is at risk of fail!
Always make sure your inputs in production are bounded
What would happen if a student failed four times? Would the production model predictions be able to cope? Developers can help you to defend against such problems. This is one issue of exporting a model, you have to produce a specification of the data treatment.
Always make sure your predictions in production are bounded
For a classification problem, your predictions are automatically bounded between 0 and 1. If this were a regression we would want to limit the predictions to be between the min and max
observed in the training set.
According to the help, for categorical predictors, the splitting point is represented by an integer, whose binary expansion gives the identities of the categories that goes to left or right.
How do I convert this into a percentage?
You can think of each decision tree in your forest as being one expert which has a slightly different life experience. It’s seen different students and might have prioritised some variables over others (sort-of). If each expert votes pass or fail then you can produce a percentage or probability of each outcome for each new student.
Seeing the forest of the trees
So your developer partners would need access to all the decision trees in a table to then build tools to read the model trees and evaluate the trees on new data. We simply need to define a function that joins the tree tables from the random forest getTree() method into one large table of trees. We write the table as a tab-separated values table (or whatever is easy for the developers software to read).
Open the raw text file we produced and inspect it. You should see 500 trees of varying thickness (number of nodes). Delve into the tenebrious forest to discover insight and excellent prediction accuracy.
We can adjust our output for our colleagues as necessary, mapping between R objects and JSON using jsonlite.
You should be comfortable exporting a random forest model to others allowing model evaluation to be reimplemented in a production environment. If it were just coefficients of a linear regression it would be even easier!
Deploying models as R HTTP services
An alternative, that is also quite easy to set-up, is to expose the R model as an HTTP service. One could copy the code and modify to our specific example from this Github example. See the comments in the repo for guidance and for a more detailed tutorial see this older blog post or this one using googleVis.
Deploy model through a simple API
We can also build a sort of “black-box” model which is accessible through a web-based API. The advantage of this is that a web call can be very easily made from (almost) any programming language, making integration of the ML model quite easy. Below we show you the structure of how you might do this based on Bart’s example. For brevity I did not complete this but you get the idea…
The result is that we now have a live web-based API. We can post data to it and get back a predicted value. We could post a query using the command line tool by seeing the URL with curl and passing the necessary student characteristics.
Show your model off; export it or set up an HTTP service or build an API.