t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding (t-SNE) Original paper (opens in a new tab) is a very popular and state of the art dimensionality reduction technique that is usually used to map high dimensional data to 2 or 3 dimensions in order to visualize it. It does so by computing affinities between points, and trying to maintain these affinities in the new, low-dimensional space.

tsne

t-SNE algorithm workflow. The loop only stops when $Y$ doesn’t change much between iterations

Compute matrix $P$ from the Data using Equation (1)
Initialize $Y$ (the embeddings) randomly
Compute matrix $Q$ from current $Y$ using Equation (2)
Compute cost from matrix $P$ and $Q$ using Equation (3)
Compute the gradient of the cost with respect to $Y$ and update $Y$
Go back to step 3

Can t-SNE help feature selection?

t-SNE is mostly used to visualize high dimensional data by embedding it in a 2D space. Since it ignores the classes, it can't tell you which variables are important for classification. PCA also ignores the classes, so although it can tell you which variables explain the variance of the data, that might not be the same as which variables best distinguish the classes. Linear Discriminant does include the classes so may be more useful.

While LDA is often a good first approach for transforming your data so that you might be able to visualize it using a scatter plot, but the nature of the method (calculating variance difference between classes) limits its usefulness.

Hence, there is a class of algorithms for visualization called manifold learning algorithms that allow for much more complex mappings and often provide better visualizations. A particularly useful one is the t-distributed Stochastic Neighbor Embedding (t-SNE).

For our Image Segmentation Dataset, it is difficult to represent original labels as data points on scatterplot, hence we will map them to integer labels

{BRICKFACE = 0, CEMENT = 1, FOLIAGE = 2, GRASS = 3, PATH =4, SKY = 5, WINDOW = 6}

And n_components we will consider as 2 because it is easier to proceed with scatterplot between only 2 components from LDA.

applytsne1(Xtrain, Ytrain)

png

Data points with labels 5 and 3 are well separated from other data points but most of the other data points still overlap significantly.

Now let us apply t-SNE. But before that what is t-SNE? The idea behind t-SNE is to find a two-dimensional representation of the data that preserves the distances between data points as best as possible. t-SNE starts with a random two-dimensional representation for each data point, and then tries to make points that are close in a original feature space closer, and points that are far apart in the original feature space farther apart. t-SNE puts more emphasis on points that are close by, rather than preserving distances between far-apart points. In other words, it tries to preserve the information indicating which points are neighbors to each other. We will apply t-SNE which is available from scikit-learn

applytsne2(Xtrain, Ytrain)

png

The result of t-SNE is quite remarkable. All the classes are quite clearly separated. The 3s (GRASS) and 5s (SKY) are somewhat split up, but most of the classes form a single dense group.

Note: t-SNE method has no knowledge of the class labels; it is completely unsupervised. Still, it can find a representation of the data in 2-dimensions that clearly separates the classes, based on how close points are in the original space.

The t-SNE algorithm has some tuning parameters, though it often works well with default settings. You can try playing with perplexity and early_exaggeration, but the effects are usually minor.

LDA (Linear Discriminate Analysis)Time Series Analysis