Last time we looked at the classic approach of PCA, this time we look at a relatively modern method called t-Distributed Stochastic Neighbour Embedding (t-SNE). The paper is fairly accessible so we work through it here and attempt to use the method in R on a new data set (there’s also a video talk).
The data science process often starts with visualisation; we want to see the data in an attempt to make sense of it. This problem of making visual sense of the data has become more problematic in recent years due to the size and dimensionality of data sets that scientists encounter. As our ability to measure the world improves, we find our old methods for interpreting the data are inadequate, thus we must iterate. Simply providing the user with a tool to view high dimensionality data in two-dimensions is not enough, the user needs help in interpreting the data also which can be achieved through dimensionality reduction methods such as t-SNE. These methods convert high dimensional data into two or three dimensions appropriate for scatterplotting.
The aim of dimensionality reduction is to preserve as much of the significant structure of the high-dimensional data as possible in the low-dimensional map. In our last blog post we looked at a linear method, this time we consider non-linear method which is superior at keeping datapoints that are similar on a low dimensional manifold closer together (a manifold is a shape that does not self intersect e.g. a line or a circle; a figure eight is not a manifold).
The aim of dimensionality reduction is to preserve as much of the significant structure of the high-dimensional data as possible in the low-dimensional map. Rather than keeping dissimiliar data points apart (like linear methods i.e. PCA), t-SNE keeps the low-dimensional representations of very similar datapoints close together on a low-dimensional, non-linear manifold (the map).
This concept can be made more interpretable by considering a Swiss roll (a non-linear manifold):
- PCA is mainly concerned with preserving large pairwise differences in the map (the squared error).
- Are these measures reliable?
- Which of the lines (solid or dashed) better captures the similarity between the two data connected points?
PCA is not very useful here as it preserves the unreliable large pairwise distances (dashed red-line Euclidean distance is unrepresentative of similarity of the two points). A better representation of the distance between these two points is captured by the blue line, as this considers the structure of the manifold.
What is reliable here are the small distances between points. Thus a method that focuses on preserving these distances between nearest neighbours in the mapping should be superior for these types of data.
Reading the paper some new concepts (represented by hyperparameters when tuning the model) arise, such as perplexity. We’ll consider these later when implementing the method using code. The paper breaks down the method into a couple of steps which is also explained clearly on Wikipedia:
- Use PCA to reduce the dimensions to a manageable number for pair-wise similarity calculations (i.e. 30) (the function we use later does this by default).
- Convert the high-dimensional data into a matrix of pair-wise similarities map.
a. t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. b. t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.
- Visualise the low dimensional (2-D or 3-D) map using a scatterplot.
Walk before you can run
We develop a toy example to help us get our head around what’s happening and the effect of tuning of the hyperparamaters on the tSNE visualisation. We did this as when learning a new techniques it’s tempting to deploy it straight away on your data without getting to grips with how it works and what the output will look like when you have input simulated data (e.g. a simple signal, or random noise). Eventually you may develop a sort of intuition for this useful tool in your arsenal.
Generate some toy data
Let’s start with a simple 2-D problem of two widely seperated clusters. We generate two clusters in the two dimensions
y by drawing from two different normal distributions far apart on both dimensions.
We can consider
df our training set. We drop the class or
id dimension from the T-SNE. The class information is not used to determine the spatial coordinates of the map points. The coloring thus provides a way of evaluating how well the map preserves the similarities within each class.
The perplexity is a hyperparamater we must tune and it sorta says how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect which we explore in the following figures. Maaten & Hinton (2008) recommend a cost function parameter for perplexity of between 5-50. The perplexity should also be lower than the number of points otherwise weird stuff happens. We suggest you read the
?Rtsne help file to make yourself aware of all the arguments therein.
With a perplexity of two our tool has performed poorly. There’s no sign of two nicely seperated clusters! Let’s investigate the impact of adjusting the perplexity by writing a custom function for this specific case to do the legwork.
Better, but still one out of place.
If we increase the perplexity any further the function protects us by erroring and reminding us that the perplexity is too large given the small number of data points.
These series of figures have warned us against just drawing one t-SNE plot. It’s important to try a variety of perplexities as in real data setting you won’t know what the generative distributions looked like. For me, this caveat makes t-SNE a dangerous magic box, as you could use it to confirm what you want to see. That’s why it’s doubly important to be aware of all the common misuses and misconceptions when using this method (protect yourself by reading the paper and a few different blogs on the topic).
There’s some stochasticity involved which can result in different conclusions. There’s some nice videos demonstrating this here. We run the t-SNE several times with different random seeds.
Note how the first and second run are similar whereas the third looks like there could be more than two clusters of data. This might be overcome by checking we have set the
max_iter-ations to high enough to allow the structure to stabilise, or running multiple plots with different seeds.
Number of steps
max_iter defaults to 1000 but there is no best number (we used 500 above in our custom function). This wil vary from data set to data set thus requiring more than one t-SNE plot before settling on what you think is stability.
The MNIST data
Prior to attempting this methodlogy on a novel data set, let’s test it on the famous MNIST digits data minimising our time spent on pre-processing and foramtting. The MNIST data set contains 60,000 grayscale images of handwritten digits. In the paper’s experiments, they randomly selected 6,000 of the images for computational reasons. The digit images have 28×28 = 784 pixels (i.e., dimensions).
We’re going to use the MNIST data shared by this blog as it’s ready to go.
Looking at the data reveals how the image data is stored.
MNIST data intro from Kaggle
The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.
Each pixel column in the training set has a name like
x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed
x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).
pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.
Visually, if we omit the “pixel” prefix, the pixels make up the image like this:
000 001 002 003 … 026 027 028 029 030 031 … 054 055 056 057 058 059 … 082 083 | | | | … | | 728 729 730 731 … 754 755 756 757 758 759 … 782 783
Given the above, we use the arguments
as.matrix to rearrange the variables to match the pixels position for imaging.
Reflect on how you are good at identifying digits. Are any of these digits ambiguous to you? What about to a young child? We then proceed to use the
Rtsne function which takes a while to run. Prior to optimisation, we reduce the data size to make things run faster for our convenience and to reduce plotting density.
Given our reduced training set and our removal of the labels it’s difficult to make out ten distinct clusters. How many can you make out? How many would you think there were without prior knowledge? This is why t-SNE can be a bit dangerous in that you can squint and see patterns that might not be there or reinforce already held beliefs. Bare in mind the common fallacies introduced above and a few extras ( described by this blog ):
- Cluster sizes in any t-SNE plot must NOT be evaluated for standard deviation, dispersion or any other similar measures. This is because t-SNE expands denser clusters and contracts sparser clusters to even out cluster sizes. This is one of the reasons for the crisp and clear plots it produces.
However, if we do add labels we can imagine what features of the digits, in terms of their pixels, are affecting the clustering.
Those digits that are clustered closer together share similar charcteristics. Think about in your life which digits cause you confusion when discerning which is which and why that might be the case. An obvious example is the number seven which can be written with a cross-bar or not, this might explain the groupings close to some fours above. A few sevens also “look” like ones. This sort of reasoning might help you pick out features of a data set that might help a machine learning classifier with its training and prediction accuracy. We could make this easier by plotting the actual images of the digits on the t-SNE plot as our plotting characters to see what features they are being clustered by. We must also remember to do more than one plot by tweaking the parameters.
Similar story to the previous plot. Try tuning the parameters yourself and see if you can create any weird effects or get strange behaviour.
Using t-SNE for feature extraction
If you visit Kaggle you often see winning methods using dimension reduction techniques generate extra features in addition to the initial features the contestants are provided with in a competition.
We demonstrate a machine learning workflow you might follow recycling some code from this Kaggle notebook. Although alternative methods are probably better (e.g. XGBoost), we attempt to implement t-SNE into Support Vector Machines (SVM).
We run the t-SNE after removing the labels reducing the digits dimensions from all the pixels to a mapping of a two dimensional non-linear manifold. We then create a training and test data set assuming the data are already randomly ordered using these dimensionally reduced data.
svm to train a support vector machine and then test it using our test data and the
predict function. We inspect the models performance using
confusionMatrix from the caret package.
Nine and four, five and eight seem to be the worst offenders for misclassifications.
Take home message
Not all numbers are created even.
- Maaten, L. and Hinton, G. (2008). Journal of Machine Learning Research 9, 2579-2605
- Wattenberg, et al., “How to Use t-SNE Effectively”, Distill, 2016. http://doi.org/10.23915/distill.00002