There seems nothing the British press likes more than a good house price story. Accordingly we use the Kaggle dataset on house prices as a demonstration of the data science workflow. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home (every dataset has a story, see here for details). I found this dataset particularly interesting, as it informs someone new to he housing market as to what variables one should ask questions about if one were to buy a house.

We start by downloading the data from Kaggle and reading the training data into R using the `readr`

package, a subset of the excellent package of packages that is the `tidyverse`

. We check the variables or features are the appropriate data class.

Inspection of some of our factors reveals that unsurprisingly the levels of the factor have not been ordered correctly. However, the ordering has not been explicitly set, as that is not the default for `factor`

. This loses us information, for example if we compare what R thinks is the case and what should be the case, we see a discrepancy. R defaults to use alphabetical order. If we were to set the levels correctly for each factor this could improve our predictions.

`BsmtQual`

: Evaluates the height of the basement

Code | Description | Inches |
---|---|---|

Ex | Excellent | 100+ |

Gd | Good | 90-99 |

TA | Typical | 80-89 |

Fa | Fair | 70-79 |

Po | Poor | <70 |

NA | No Basement | NA |

### Missing data

Every messy data is messy in its own way - Hadley Wickham

There is missing data for a variety of combinations of the variables. The visualising indicator matrices for missing values is a shortcut (`visna()`

). The columns represent the missing data and the rows the missing patterns. Here we have plenty of missing patterns. The bars beneath the columns show the proportions of missingness by variable and the bars on the right show the relative frequencies of the patterns. Most of the data is complete.

The missing data is found in about a dozen or so of the variables (those on the left). The variables that contribute to the bulk of the data (note the heavy skew of the dodgy variables). `PoolQC`

, `MiscFeature`

, `Alley`

and `Fence`

tend to be missing. These variables warrant closer inspection of the supporting documentation to suggest why this might be the case.

### Keep it simple

For now, we will keep things simple by ignoring all categorical variables and dropping them. We also remove the artificial `Id`

variable.

### Outliers

Let’s look for any unusual outliers that may affect our fitted model during training. As `SalePrice`

is our response variable and what we are trying to predict, we get a quick overview using the scatterplot matrix for numeric data. We don’t plot it here but provide the code for you to explore (it takes a minute to compute).

Let’s take a closer look at `GrLiveArea`

, there seem to be four outliers. Accordingly, I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations).

## Inspecting numeric variables correlated with the response variable

This display of the correlation matrix shows the most important variables associated with `SalePrice`

. This provides a good starting point for modelling and or feature selection. For example `OverallCond`

shows poor correlation with `SalePrice`

, perhaps we need to adjust this variable to improve its information content. Or perhaps people ignore the condition and think of the property as a fixer upper opportunity. As you can see there is huge depth to the data and it would be easy to feel overwhelmed. Fortunately, we’re not trying to win the competition, just produce some OK predictions quickly.

Premature optimization is the root of all evil - Donald Knuth

## Regression

To simplify the problem and celebrate the `mlr`

package release (or at least my discovery of it), I implement some of the packages tools for regression and feature selection here. For a detailed tutorial, which this post draws heavily from, see the mlr home page. Also some Kagglers have also contributed many and varied useful ideas about this problem.

### Machine Learning Tasks

Learning tasks encapsulate the data set and further relevant information about a machine learning problem, for example the name of the target variable for supervised problems, in this case `SalePrice`

.

As you can see, the Task records the type of the learning problem and basic information about the data set, e.g., the types of the features (numeric vectors, factors or ordered factors), the number of observations, or whether missing values are present.

### Constructing a Learner

A learner in `mlr`

is generated by calling `makeLearner`

. In the constructor we specify the learning method we want to use. Moreover, you can:

- Set hyperparameters.
- Control the output for later prediction, e.g., for classification whether you want a factor of predicted class labels or probabilities.
- Set an ID to name the object (some methods will later use this ID to name results or annotate plots).

### Train

Training a learner means fitting a model to a given data set. In `mlr`

this can be done by calling function `train`

on a Learner and a suitable Task.
`

Function train returns an object of class `WrappedModel`

, which encapsulates the fitted model, i.e., the output of the underlying R learning method. Additionally, it contains some information about the Learner, the Task, the features and observations used for training, and the training time. A `WrappedModel`

can subsequently be used to make a prediction for new observations.

### Predictions

### Submission

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

We then submit this on Kaggle!

## Feature engineering

We likely want to code categorical variables into dummy variables and think about how to combine or use the available variables for this regression problem to further reduce our RMSE below 0.27. There are also some tools for feature filtering in `mlr`

.

For details on how to do that see the `mlr`

tutorials pages.

## Conclusion

Following house prices is a national obsession. Here we elucidate an alternative dataset to the traditional `MASS::Boston`

suburban house values with a more contemporary, comprehensive and complicated data set. This contributes to the Kaggle learning experience by providing Kagglers with a mild introduction into Machine Learning with R, specifically the `mlr`

package. We predict house prices with a respectable 0.27 RMSE using out of the box approaches. Feature selection will help nudge the accuracy towards the dizzying heights of the Kaggle scoreboard albeit with a high demand on the Kaggler’s time and insight.

## References

- Bischl, B., Lang, M., Richter, J., Bossek, J., Judt, L., Kuehn, T., . Kotthoff, L. (2015). mlr: Machine Learning in R, 17, 1-5. Retrieved from http://cran.r-project.org/package=mlr
- Cock, D. De. (2011). Ames , Iowa : Alternative to the Boston Housing Data as an End of Semester Regression Project, 19(3), 1-15.