Data Scientist

The Game of Kaggle

The Player of Games

What’s the best way to teach oneself machine learning? Is it to do an online course, write a blog or compete in online programming competitions?

This blog post describes my first interaction with / or game of Kaggle.

Kaggle history

In 2010, Kaggle was founded as a platform for predictive modelling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know at the outset which technique or analyst will be most effective.

Gameification

I was interested in the gaming element and how user’s of Kaggle describe themselves as players. This blog post describes my first foray into Kaggle with the well known Titanic survival problem.

A Titanic problem

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking (Rose found a door to float on), some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, Kaggle ask us to complete the analysis of what sorts of people were likely to survive. In particular, they ask us to apply the tools of machine learning to predict which passengers survived the tragedy.

Challenge accepted!

Getting the data

The data can be downloaded from here. I saved the files into a newly created directory called titanic in my R folder. We are provided with train and test data; we train our predictive model then test our ability to accurately predict whether a passenger is likely to survive or not based on their characteristics in thetest data (and assess how our model performs when it processes those characteristics and makes a prediction of the fate of the passenger).

Passenger characteristics

If we compare the training and testing datasets we notice that the training set has an additional variable. This is the response variable we are trying to predict in this supervised learning task. Our machine learning algorithms need to identify characteristics which are important in determining whether a passenger is likely to survive (Survived = 1) or not (Survived = 0).

This data has a lot more to it than the Titanic dataset which comes with the datasets package in R.

Tables are not that effective for spotting patterns in the data. Plotting can be helpful even for complicated Multivariate Categorical data. Let’s have a look at our training data.

“Women and children first” (or to a lesser extent, the Birkenhead Drill).

The relative width of the bars in the mosaic plot tells us that there were more men on board the Titanic (this difference in width is a limitation of the plot as humans struggle to discriminate between different sized areas). Also it is apparent that if you were a randomly selected woman you were more likely to survive (Survived = 1) than a randomly selected male (who was more likely to die, Survived = 0). This could be the basis of a simple predictive tool to use on the test data if we were feeling lazy. We pursue this this here to outline how one would participate in a Kaggle problem.

This simplistic approach can be modelled using a logistic regression AKA generalised linear model with binomial error distribution and logit link function. We can use the coef() function in order to access just the coefficients. The estimates and their standard errors are in logits. We can use the predict() function to predict the probability that the passengers in the test data survive or die based purely on their sex. The higher probabilities should pair with female and the lower with male given our simplistic model.

However, passengers either Survived (= 1) or they died (Survived = 0), none of this probability malarky. Thus we convert our predictions into a 1 if greater than 0.5, or a zero if less than or equal to 0.5. Then we would tabulate against the actual Survive status of the test passengers, if we had it…

Kaggle is great, no peeking at the test data! How do we go about submitting our predictions?

Submission

To submit your predictions go to the Titanic submission page and upload your file. This got me a respectable accuracy of ~77% and ranked me in the mid 3000 (of 4000 or so) which you can easily Tweet to impress your friends. Interestingly some players managed to get a prediction accuracy of 0%…

That was just a quick demo to show how you might go about this problem. Let’s continue to look at the data and then put in a bit more effort in to generating a suitable model that accounts for the relevant characteristics of the passengers and how that affects their probability of survival. I’m interested in investigating how far we can get in improving model accuracy with using simple out of the box strategies and how far we can get up the leaderboard! I don’t expect to win but as a data scientist I am keen to learn strategies and processes that result in the most significant gains in predictive accuracy through our variety of modelling problems - Kaggle provides just that!

Learning through gaming on Kaggle

This simplistic model aside, an important aspect of machine learning that Kaggle elucidates is how easy is it to make a decent predictive model using out of the box algorithms and techniques, even if we are new to the art. There are plenty of blog posts which expand on this Titanic data set and come up with clever ways of improving model performance. Feature engineering is particularly neat.

Class effects?

Do these individual graphics miss anything? Icluding a main effect of Pclass might be useful.

What about class and sex effects and the interaction between the two?

There appears to be an ineteraction with the class of the sexes associated with different chance of survival.

Improving the model

We iterate the model and improve. Exploring the data visually identifies variables that might improve the model if incorporated. We include passenger sex, passenger class and the interaction between the two in the model.

We submit our new improved model. This results in exactly the same predictions for each test passenger! We have mot improved our position in the scoreboard. If we look closely we see that we are still not getting enough information to distinguish any nuance, as there are two sexes and three classes, we only get six different possible probabilities from the logistic regression based on the coefficients from our model fit to the training data. Although if we inspect the coefficients we actually have fewer than that…

Model Interpretation

The coefficients are from the linear predictor. They are on the transformed scale, so because we are using binoimial errors they are in logits. We convert from a logit to a proportion.

The (intercept) says that the mean survival rate of first class females (female before male alphabetically, with Pclass a variable of class integer) was 99.8%. First class females were unlikely to die.

What about the parameter for Sexmales? Remember that with categorical explanatory variables the parameter values are differences between means. So to get the male survival rate we add 0.00235 to the intercept before back transforming.

This suggests that the survival rate for men was almost half that of women.

What about class? For every unit change in Pclass, as you move between classes from higher (1), to middle (2) to lower (3), the log odds of survival decreases.

There is also a significant interaction term. There is compelling evidence that the different sexes of passengers responded different to their class status in terms of their likely survival. There appears to be a rough halving of survival between the 1st and 3rd classes for both sexes but in woman 2nd class females were not far behind their first class counterparts in probability of survival.

Tree based methods

We use a decision tree approach to build the set of splitting rules used to segment the predictor space. This is to improve our interpretation of the logistic regression model which can be difficult due to the the use of a non-intuitive but mathematically convenient logit link function. By providing a simple and useful interpretation it may assist us in deciding how to proceed in order to improve our prediction accuracy.

We go ahead and fit a decision tree after modifying the data slightly. We specify the qualitative variables as factors, ordered if appropriate. The new variables Fare and SibSp are included in the model. Presumably Fare is the amount paid for a ticket and SibSp

The root node, at the top, reveals that 62% of passengers died, while 38% survived. The number above these proportions indicates the way that the node is voting (at this top level it defaults to everyone will die, or coded as zero or coloured grey) the number at the bottom of the node indicates the proportion of the population that resides in this node, or bucket (here at the top level it is everyone, 100%).

All the men are assigned to the second node (labelled 2). This is a bucket of death where we assign all males to die, even though only 81% of the men died. More men were onboard and thus this bucket contains 65% of the total passengers (of the training data).

This decision tree provides quick rules of thumb for whether your future was bright (orange) or grey (dead). If you were a woman in the lowest class you chances were reduced to your wealthier classes. However if you paid a high end fare your chances improved despite being in the lowest class. Such intricacies and low number of samples in some of the leaves or buckets or nodes near the bottom suggest we may be in danger of overfitting.

Let’s make some predictions from this simple tree and see if we built our simple Sex model earlier. We get our test data in a similar format as our training data, with appropriate factor classes. We then use predict() to predict whether the test passengers are likely to survive given their characteristics. We then write this to a file ready for submission!

An improvement of 0.02392% accuracy, this moved us up the leaderboard almost 2000 places.

Pruning the tree

Inspecting the tree suggested it might be overlycomplex. We might want to prune the tree by recursively snipping off the least important splits based on the complexity parameter (I used trial and error for this value after reading the help file).

Now we compare how this does to our previous models. Did the reduction in complexity avoid overfitting and a reduction in test error or are we now missing out on modelling genuine complexity or patterns in the data?

This tweaking caused no change in accuracy and failed to improve on our original decision tree! However, we did manage to prune a bunch of nodes or leaves if you look and compare the two trees. These extra nodes and branches were not useful for making predictions adding unneccessary complication to the model. Although we did not reduce our test error we improved our understanding, with the data suggesting that within the third class there may have been another strata of sub-class which influenced your chance of survival as a female. The wealthier women in the third class who paid a larger Fare were less likely to die.

KaggleR

This is what Kaggle is all about brute forcing the solution to a problem by getting hundreds of top minds (teams or minds) interacting with a computer to try a vast combination of machine learning methods and parameter settings, all trying to work their way up the leaderboard. This combined effort will usually result in a much better solution to a problem than an in-house team good generate for a predictive problem. The gameification is also satisfying for the players and an interesting alternative for learning predictive analytics with machine learning as a compliment to the traditional textbook or online course.