Matt Gregory bio photo

Matt Gregory

Data Scientist

Twitter Github

Sometimes one gets handed so much data you don’t know where to begin! You might not even have an associated response variable to complement the hundreds of explanatory variables provided. Instead of attempting to make predictions we can try to make sense of the data using unsupervised learning techniques.

This is challenging in contrast to supervised learning, where there is a simple goal for the analysis, here there is no way to check our work because we don’t know the true answer (we have no training set to compare our predictions against).

It can be hard staying on top of all this esoteric knowledge and having a full understanding. Instead it may be preferred to have a “sorta” understanding of what Principal Components are and when we should use them (Dennett, 2013).

Principal Component Analysis (PCA) is a common technique for finding patterns in data of high dimension. For the application of PCA to genetic data, take a look at the paper by Reich et al 2008. The authors argue, more generally, for a careful use of the analysis tool when interpreting data. Supporting the notion, that the knowledge of how a method works, helps to avoid misinterpretation and strengthens the conclusions one draws (see here for a more detailed discussion of PCA).

I remember using PCA for the first time as a Zoology undergraduate on the characteristics of Lavender plants either hosting a Crab Spider or not. I then went on to use the principal component score vectors as features for regression, this is another use for PCA. At the time I didn’t really understand the methodology so seek to address that retrospective learning opportunity here.

What are Principal Components?

Principal components allow us to reduce the number of dimensions (representative variables) of a data set and make it more manageable or interpretable while still explaining most of the original variability in the data.

We paraphrase the excellent stack overflow answer:

Imagine the explanatory variables or features of your data contain some redundant information (information that is captured by another feature) then there is the potential to summarise the same data with fewer new features. PCA does this by finding features that show as much variation across observations or, put another way, PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is determined by how much the observations vary along each dimension.

Given that we are only interested in variance and not interested in the scale, the variable should be centered to have mean zero prior to PCA.

Visualising PCA

We demonstrate PCA by comparing the outcomes of the methodology on:

  • two uncorrelated variables
  • two correlated variables

Get some data

Let’s keep it simple and generate our own data; we use just two dimensions so that we can visualise what’s happening.

We draw two variables $x$ and $y$ from different random normal distributions. Thus, we might expect there to be little to no correlation between the variables; they are uncorrelated.

x <- rnorm(n = 30, mean = 10, sd = 3)
y <- rnorm(n = 30, mean = 35, sd = 12)
df <- tibble::as_tibble(list("x" = x, "y" = y))
## # A tibble: 30 x 2
##            x        y
##        <dbl>    <dbl>
##  1 10.577476 50.15179
##  2  5.659895 48.19939
##  3  9.030458 21.51734
##  4 14.866888 76.35295
##  5  7.932928 26.36333
##  6 16.126367 31.07775
##  7 12.831337 28.64271
##  8 16.245781 41.28983
##  9 15.751352 22.04468
## 10  8.755563 34.33867
## # ... with 20 more rows

Centre the data

For PCA to work properly, you have to subtract the mean from each of the data dimensions. The mean subtracted is the average across each dimension. So, all the $x$ values have $\bar{x}$ (the mean of the $x$ values of all the data points) subtracted, and all the $y$ values have $\bar{y}$ subtracted from them. This produces a data set whose mean is zero.

# df adjusted, centred on zero
df_adj <- dplyr::mutate(df, 
                        x = x - mean(x),
                        y = y - mean(y))
## # A tibble: 30 x 2
##             x          y
##         <dbl>      <dbl>
##  1 -0.4749589  13.663036
##  2 -5.3925400  11.710634
##  3 -2.0219762 -14.971419
##  4  3.8144537  39.864188
##  5 -3.1195070 -10.125424
##  6  5.0739320  -5.411007
##  7  1.7789027  -7.846046
##  8  5.1933460   4.801075
##  9  4.6989172 -14.444074
## 10 -2.2968713  -2.150090
## # ... with 20 more rows

Or plotted, where unadjusted are black and adjusted are red (zero on each dimension shown by blue dashed line). Note how we centred the mean on zero but didn’t scale the standard deviation of each variable to one.

plot(x = df$x, y = df$y, 
       xlim = c(-10, 20),
       ylim = c(-40, 80),
       xlab = "x",
       ylab = "y")
abline(h = 0, col = "blue", lty = "dashed")
abline(v = 0, col = "blue", lty = "dashed")
# add the points of df_adj
points(x = df_adj$x,
       y = df_adj$y,
       col = "red", 
       pch = 19)

plot of chunk 2017-09-24_scatter_df

Scaling the Standard Deviation

Page 381 in “An Introduction to Statistical Learning” argues that we should typically scale the standard deviation of each variables to performing PCA. In our example, given the scales of the variables are similar, and we are lazy, we chose not to. We accept that you should normally, especially if the standard deviation of the variables differs considerably.

Calculate the covariance matrix

Covariance is always measured between two dimensions. If you have a $p$ dimensional data set, then the matrix has $p$ rows and columns (so is square) and each entry in the matrix is the result of calculating the covariance between two separate dimensions. E.g.. the entry on row 2, column 3, is the covariance value calculated between the 2nd dimension and the 3rd dimension.

Since the data is 2-dimensional, the covariance matrix will be (2 by 2).

xy_cov <- cov(df_adj)
##          x         y
## x 11.29917  18.31917
## y 18.31917 198.31912

What this means is that the variance of the $x$ variable is 11.2991702, the variance of the $y$ variable is 198.3191235, and the covariance between them is 18.3191701. As it is a square symmetric matrix, it can be diagonalized by choosing a new orthogonal coordinate system, given by its eigenvectors (incidentally, this is called spectral theorem); corresponding eigenvalues will then be located on the diagonal.

Calculate the eigenvectors and eigenvalues of the covariance matrix

Given the covariance matrix is square we can calculate the eigenvectors and eigenvalues which tell us some useful information about our data (they allow us to decompose matrices using algebra).

We can calculate the determinant first if you like that sort of thing. It can be viewed as the scaling factor of the transformation described by the matrix and is commonly used in mathematics to represent the coefficients in a system of linear equations.

## [1] 1905.25

Eigenvectors and eigenvalues of the covariance matrix

The determinant allows the eigendecomposition of the covariance matrix (xy_cov) for the eigenvalues and eigenvectors (of length one).


xy_eig <- eigen(xy_cov)
## eigen() decomposition
## $values
## [1] 200.096647   9.521646
## $vectors
##            [,1]        [,2]
## [1,] 0.09657723 -0.99532549
## [2,] 0.99532549  0.09657723

That all sounds very impressive, but what does it mean? We plot the normalised data (mean subtracted) with the eigenvectors of the covariance matrix overlayed on top. We can visualise the eigenvectors as we know they go through the origin (as we centred every dimension on zero by subtracting the mean) and the eigenvector providing the gradient (note how there are two unique gradients). We colour them green here.

plot(x = df_adj$x,
       y = df_adj$y,
       xlab = "x",
       ylab = "y",
       col = "red",
     pch = 19,
     cex = 0.8)
# origin axes
abline(h = 0, col = "blue", lty = "dashed")
abline(v = 0, col = "blue", lty = "dashed")
# eigenvectors
abline(a = 0, b = xy_eig$vectors[1, 1],
       col = "green", lty = "dashed")
abline(a = 0, b = xy_eig$vectors[1, 2],
       col = "green",
       lty = "dashed")

plot of chunk 2017-09-24_scatter_eig

As expected from the covariance matrix, the two variables produce a “sky at night” or random distribution suggesting weak correlation (remember how we generated the data). On top of the data we have plotted both the eigenvectors as well. They are almost parallel to each other. The eigenvectors of the covariance matrix allow us to characterise the data.

At this stage it is important to remind ourselves that our data we produced from two random normal distributions, and on average, are likely to be uncorrelated. We demonstrated this so that when you’re doing PCA you might appreciate what the different values for the eigenvectors and eigenvalues might mean in terms of how related two variables might be. This could be extended by running this many times and due to random sampling you might expect to see some strongly correlated variables through chance.

Correlated variables

Let’s repeat the process with two correlated variables. To make the process a bit more compact we’ll use a function to define a class we call “eigenised”.

eigeniser <- function(x, y) {
  # as dataframe
  df <- tibble::as_tibble(list("x" = x, "y" = y))
  # Scale
  df_adj <- dplyr::mutate(df,
                          x = x - mean(x),
                          y = y - mean(y))
  # Calc covariance matrix
  covar <- cov(df_adj)
  # eigenise
  eig <- eigen(covar)
  # Define the class here
  output <- structure(list("df" = 
                           "df_adj" = 
                           "covar" = covar,
                           "eig" = eig
                      class = "eigenised"

We provide our function with some new data that is more strongly correlated.

two_correlated_variables <- eigeniser(x = rnorm(n = 30, mean = 10, sd = 3),
                                      y = (x + runif(n = 30, min = -3, max = 3)) )

plot of chunk 2017-09-24_scatter_df_two

We plot our object as before (we could write a method to do this) which is easier using the “eigenised” class as everything is stored in one object.

plot(x = two_correlated_variables$df_adj$x,
       y = two_correlated_variables$df_adj$y,
       xlab = "x",
       ylab = "y",
       col = "red",
     pch = 19,
     cex = 0.8)
# origin axes
abline(h = 0, col = "blue", lty = "dashed")
abline(v = 0, col = "blue", lty = "dashed")
# eigenvectors
abline(a = 0, b = two_correlated_variables$eig$vectors[1, 1],
       col = "green", lty = "dashed")
abline(a = 0, b = two_correlated_variables$eig$vectors[1, 2],
       col = "green",
       lty = "dashed")

plot of chunk 2017-09-24_eig_df_two

The eigenvectors appear as green diagonal dotted lines perpendicular to one another on the plot. They provide us with information about the patterns in the data. See how one of the eigenvectors goes through the middle of the points, like drawing a line of best fit? That eigenvector is showing us how these two data sets are related along that line. The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main eigenvector, but vary about the main line by some amount.

We can see how much more important one eigenvector is than the other by looking at the size of the eigenvalues. As we simulated this data to be highly correlated we can see that one eigenvalue is much larger than the other, compare this to our previous example when the two variables were uncorrelated. We also note how our data appear to be more strongly correlated as characterised by taking the eigenvectors of the covariance matrix.

## eigen() decomposition
## $values
## [1] 24.819142  1.597739
## $vectors
##           [,1]       [,2]
## [1,] 0.6463588 -0.7630336
## [2,] 0.7630336  0.6463588

Now that we have the eigenvectors ranked by eigenvalue order we can decide to ignore the principal components with lower eigenvalues (those that characterise the data poorly). This in effect reduces the number of dimensions required to represent the data. Here we could go from the original two dimension of $x$ and $y$ to using just the first principal component.

two_correlated_variables$eig$vectors[ , 1]
## [1] 0.6463588 0.7630336

We can check all this by comparing it to the inbuilt functions for PCA, prcomp.

pr_out <- prcomp(two_correlated_variables$df, scale = FALSE)
##         PC1        PC2
## x 0.6463588 -0.7630336
## y 0.7630336  0.6463588

Woo, they’re the same! To compute the proportion of variance explained by each of the principal components, we simply divide the variance (standard deviation squared) explained by each principal component by the total variance explained by both prinicpal components.

pr_var <- pr_out$sdev ^ 2
## [1] 24.819142  1.597739

Recognise these values? That’s right, they’re our eigenvalues; a measure of the variance in the data explained by each of the principal components.

## [1] 24.819142  1.597739

And as a proportion of the total variance.

pve <- pr_var / sum(pr_var)
## [1] 0.93951828 0.06048172

Given the high correlation between the two variables (and the limited noise in that correlation) our first principal component captures most of the signal. That message is reinforced below, where the PC1 captures most of the variation. The data varies little about PC2.

biplot(pr_out, scale = 0)

plot of chunk 2017-09-24_biplot

Derive the new data

Paraphrasing from here.

After choosing the principal components (eigenvectors) that we wish to keep in our data and formed a feature vector, we simply take the transpose of the vector and multiply it on the left of the original data set, transposed.

# tranpose just makes it easier to work with
final_data <- 
  t(two_correlated_variables$eig$vectors[, 1]) %*%
# we re-transpose it to produce our 30 observations reduced to 1 dimension
final_nice <- tibble::as_tibble(list(PCA1 = t(final_data)[, 1]))
## # A tibble: 30 x 1
##         PCA1
##        <dbl>
##  1 16.723723
##  2  6.850355
##  3 14.395403
##  4 22.930774
##  5  9.489425
##  6 21.041405
##  7 20.372188
##  8 23.022007
##  9 20.990495
## 10 12.571902
## # ... with 20 more rows

Thus we have reduced the number of dimensions in our data by using the first principal component.

So what?

Basically we have transformed our data so that it is expressed in terms of the patterns between them, where the patterns are the lines or eigenvectors that most closely describe the relationships between the data. This is helpful because we have now classified our data point as a combination of the contributions from each of those lines and in fewer dimensions.

Initially we had the simple $x$ and $y$ axes. This is fine, but the $x$ and $y$ values of each data point don’t really tell us exactly how that point relates to the rest of the data. Now, the values of the data points tell us exactly where (i.e. above or below) the trend lines the data point sits. The single-eigenvector decomposition removed the contribution due to the smaller eigenvector (PC2) and left us with data that is only in terms of the other (PC1).

Take home message

PCA is just a method of summarising data. We might have a whole bunch of variables describing many different observational units. Many of these variables will measure related properties and so will be redundant (or another way, they are highly correlated with one another). If so, we should be able to summarize each observational unit with fewer features. This is what PCA is all about!



  • Dennett, D. C. (2013). Intuition pumps and other tools for thinking. New York: W.W. Norton & Co..
  • James, G., Witten, D., Hastie, T., Tibshirani, R., (2013). Chapter 10 of An Introduction to Statistical Learning.
  • Reich, D., Price, A., and Patterson (2008). Principal component analysis of genetic data. Nat Genet, 40(5):491–2
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] bayesAB_0.7.0 Matrix_1.2-9  ggplot2_2.2.1 simecol_0.8-7 deSolve_1.14 
## [6] bindrcpp_0.2 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12     knitr_1.16       bindr_0.1        magrittr_1.5    
##  [5] munsell_0.4.3    lattice_0.20-35  colorspace_1.3-2 R6_2.2.2        
##  [9] rlang_0.1.2      minqa_1.2.4      highr_0.6        stringr_1.2.0   
## [13] plyr_1.8.4       dplyr_0.7.3      tools_3.4.0      grid_3.4.0      
## [17] rmd2md_0.1.1     gtable_0.2.0     lazyeval_0.2.0   yaml_2.1.14     
## [21] checkpoint_0.4.0 assertthat_0.2.0 digest_0.6.12    tibble_1.3.4    
## [25] codetools_0.2-15 glue_1.1.1       evaluate_0.10    stringi_1.1.5   
## [29] compiler_3.4.0   scales_0.4.1     pkgconfig_2.0.1