Understanding t-distributed stochastic neighbour embedding

Pillars
IndiaAI Portal
Resources
Ecosystem
Sectors

Back

Results for ""

IndiaAI Recommends

The t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique for data exploration and visualization of high-dimensional data. In simplest terms, t-SNE provides an impression or intuition of how data is organized in a high-dimensional space.

It is a statistical method for visualizing high-dimensional data by assigning a location in a two or three-dimensional map to each data point. It uses nonlinear dimensionality reduction to embed high-dimensional data for visualization in a two- or three-dimensional low-dimensional space. It explicitly characterizes each high-dimensional object by a two- or three-dimensional point, with neighbouring points describing similar objects with high probability and distant points modelling distinct objects.

t-SNE process

t-SNE looks for patterns in the data based on how similar the data points are regarding their features. The similarity of the points is calculated as the chance that point A would choose point B as its neighbour if point A were in a specific situation. Then, it tries to reduce the difference between these conditional probabilities (or similarities) in higher and lower-dimensional space to get a perfect representation of data points in lower-dimensional areas.

t-SNE Python Code Implementation on MNIST Dataset

# Importing Necessary Modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

Load the MNIST dataset into a pandas dataframe now. This dataset is available for download here.

# Reading the data using pandas
df = pd.read_csv('mnist_train.csv')
# print first five rows of df
print(df.head(4))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and
# store the pixel data in d.
d = df.drop("label", axis=1)

We must normalize the data before applying the t-SNE method to it. As we know, the t-SNE algorithm is a difficult algorithm that uses specific complex non-linear techniques to translate high-dimensional data to lower-dimensional data, which helps us save part of the time complexity required to finish the reduction process.

# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)

Let's now reduce the 784-column data to two dimensions so that we may visualize it with a scatter plot.

# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# Creating a new data frame, which helps us plot the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
	columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.scatterplot(data=tsne_df, x='Dim_1', y='Dim_2',
			hue='label', palette="bright")
plt.show()

Python code source: Geeksforgeeks

Conclusion

Numerous fields, including genetics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biological signal processing, have adopted t-SNE for visualization.

Although t-SNE plots frequently appear to show clusters, the visual clusters can be highly influenced by parameterization, so a thorough grasp of the t-SNE parameters is required. Such "clusters" may be erroneous conclusions because it has been demonstrated that they can even exist in non-clustered data. It has been shown that t-SNE frequently recovers well-separated clusters and, with particular parameter selections, approaches a straightforward form of spectral clustering.

Sources of Article

Image source: Unsplash

IndiaAI Recommends

DETaiLED - Understanding t-distributed stochastic neighbour embedding

t-SNE process

t-SNE Python Code Implementation on MNIST Dataset

Conclusion

Sources of Article

Want to publish your content?

ALSO EXPLORE