### Matt Upson

Yo no soy marinero

# Getting to grips with 'Elements of statistical learning'

### Overview

Last week I joined a reading group for the weighty tome Elements of Statistical Learning. I really like the idea of this group; interesting as it is - it can be hard to maintain the drive to wade through a text like this. Working through it week on week with a group of like-minded people is a great way to overcome this.

# Linear models

In this post I implement in R some of the ideas that are presented in the first 2 chapters of Elements of Statistical Learning, namely: least squares. I’ve written about linear models before; whilst doing Andrew Ng’s excellent Machine learning I wrote a toy R package vlrr to implement linear regression with regularisation mainly as an exercise in ackage development.

## Prediction from linear models

Linear models are the bread and butter of statistics. They are pretty fundamental, and I don’t want to write too much about them here for that reason.

However, one thing I like about the way that Hastie et al. present the subject is in terms of matrix operations, giving a vectorized implementation that can easily be translated into code.

In simple statistics, we are used to seeing linear models represeted as:

Where our prediction $y$ is dependent on a rate or slope $b$, and a constant or intercept $a$. Simple.

Hastie, et al. present the more general form of this equation1:

Here, the prediction $Y$ is given by the addition of the intercept (or bias) $\beta_0$ and the sum of the dot product of $X_{1..m}$ and $\beta_{1..m}$ where $X$ is an $n$ by $m$ dimensional matrix ($X\in\mathbb{R}^{n \times m}$), and $\beta$ is an $n$ dimensional vector (or later as we shall see: $\beta\in\mathbb{R}^{k \times n}$ - where $k$ is the number of models we wish to apply).

By including a constant of 1 as the first column vector of the matrix $X$, it is possible to greatly simplify this equation into an matrix inner product:

This can be a bit of a leap, so I break this down more simply here. Let us create an input vector $\vec{x}$ where $\vec{x}\in\mathbb{R}^{m}$: i.e. there are $m$ training examples, and in this case: $m=10$.

To get my prediction out of $\vec{x}$, I need to supply coefficients $a$ and $b$ (or $\beta$ if we use Hastie et al’s notation; $\theta$ if we use Andrew Ng’s).

In this case, I’m going to use a very simple model by setting $a = 1$, and $b = 10$. I’m not worrying about how we get these coefficients here, and nor should you be - we’re just interested in the predictions we are going to make using these our input $\vec{x}$ and our coefficients $\beta$

So this is how I calculate that prediction in R:

Of course, R has vectorised the operation here, so every element of our vector $\vec{x}$ was multiplied by $a$. We can express this as:

## Thinking in matrices

### Matrix operations recap

A quick reminder of matrix multiplication (for my benefit if nothing else). To multiply a matrix $A\in\mathbb{R}^{3 \times 2}$ with matrix $B\in\mathbb{R}^{2 \times 3}$, we multiply $A$ with each of the columns of $B$ to give $C\in\mathbb{R}^{3 \times 3}$. Remember that the number of columns of $A$ must equal the number of rows of $B$, e.g:

So, for instance:

### Linear model by matrix inner product

First we place our coefficients $a$ and $b$ into the column vector $\beta$, and add the constant 1 to the input vector $x$ to give the $n + 1$ (one input vector, and the constant 1) by $m$ (the number of training examples) matrix $X$.

So applying those quick recaps to our equation $Y=X^T\beta$, we get2:

In R we can do this simply with:

We can check this against the earlier calculation of $y$:

# Least squares

Ok, this is all well and good, we can make a prediction $Y=X^T\beta$, but how on earth do we get $\beta$. In previous posts, I variously used gradient descent, the BFGS algorithm, and the ‘normal equation’ or least squares method, which is what I will reproduce here.

This method provides an exact solution for a given linear model, which is handy, but there are situations where this method may not be appropriate. The main issue with the normal equation, is that when dealing with very large amounts of data i.e. $n>10,000$ then the imperative to solve the matrix inverse $(X^TX)^{-1}$ means that it can be computationally expensive. In addition, there are cases when the matrix given by $(X^TX)$ will not be invertible, and so will simply not work. This typically occurs when feaures $X_i$ and $X_j$ are linearly dependent, or when there are too many input features, i.e. $X$ is wider than it is long, i.e. $p»n$ problems.

To calculate $\beta$ we can simply solve the equations:

This is the notation that Hastie, et al. use, and RSS stands for the residual sum of squares. This simplifies (in matrix notation) to:

In R, this looks like solve(t(X) %*% X) %*% (t(X) %*% y), which should return 10 1:

### QR decomposition

Actually it turns out that using this solution is not the most efficient. Leisch3 counsels against it and instead, the base implemetation of lm() uses QR decomposition.

A quick Google, and you will see that QR decomposition has been considered to be one of the most influential algorithms of the 20th Century. In simple terms4, a QR decomposition is the breaking down of a matrix into two product matrices with specific properties. If we start with a matrix $M$, QR decomposition would give us $M=QR$ where $Q$ is an orthogonal matrix, and $R$ an upper triangular matrix.

So for the matrix $X$ that we have been using so far, we can do this in R with the following:

This gives us:

I’m not going to go into any more detail here, but suffice it to say that the qr object can simply be solved in R to return our coefficients $\beta$:

The explanation for why QR decomposition is favoured over solving the normal equation rests in part in the expensive operation $(X^TX)^{-1}$. In my experiments (which were admittedly not very scientific), the QR method seemed to take as little as half the time of least squares when trying to solve $X\in\mathbb{R}^{m \times n}$ for large matrices. Furthermore, where $n$ is much larger than $m$ (say 10 times), the normal equation fails completely, and will return the following error in R:

system is computationally singular: reciprocal condition number

whilst the QR method will at least complete (see the underlying .Rmd for an example I tried).

# Linear models for classification

So now we have seen how to get the parameters $\beta$, I will use a linear model in anger. Here I reproduce the example by Hastie et al. to show a simple linear model used for two class classification.

## Generate some data

First we generate data based on two distinct normal distributions, which we will seek to separate usin gthe linear model. I’ve copied this code from my earlier post on k-means.

In the code chunk below I use Hadley’s excellent purrr package to create 10 bivariate normal distributions, which are then plotted together. The reason for this will become apparent when I move onto nearest neighbour methods in my next post.

I set $G\in{0,1}$ (having divided the bivariate distributions roughly by eye); so now we can train a model based on $G$ to find a decision boundary.

Now lets plot the data to see what we have.

## Applying the linear model

Now lets try to apply a linear model to the data, using $G$ as the explanatory variable.

For the sake of argument, I use both the normal equation for least squares, but also use the $QR$ decomposition method.

And we can check that these match…

Great. So how does our model do?

So, most of the time, this very simple model is sufficient to make this binary classification. Only in 20/300 cases do we get a Type II error (a false negative), whilst in 2/300 cases we get a Type I error (a false positive).

To plot the decision boundary, we need to create a grid of predictions which we can then divide by running the linear algorithm on. The following function does this, then we can include the boundary on a plot with geom_contour().

And, plotting it out using shapes to indicate the predictions, we can see that the decision boundary runs a little above the top of the actual $0$ class. Anything above this line, our model has predicted to be $G=1$, and below it $G=0$. The two pink triangles are our false positives, whilst the blue circles below the line are our false negatives.

I look forward to delving deeper into The Elements of Statistical Learning.

1. From this equation onwards, I drop Hastie et al’s convention of denoting matrices with $\hat{}$.

2. One of the confusing parts of this notation is that we don’t actually want to transpose $X\in\mathbb{R}^{10 \times 2}$ into $X^{T}\in\mathbb{R}^{2 \times 10}$, as $X^T$ will not be conformable with $\beta\in\mathbb{R}^{2 \times 1}$. Instead, we want an inner product which is $X \cdot \beta$ or each row of $X$ multiplied by each column of $\beta$; in R this is X %*% beta, not t(X) %*% beta

3. https://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf