t-Distributed Stochastic Neighbor Embedding (t-SNE) for Non-Linear Dimensionality Reduction
====================================================================================
Introduction
--------------
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique used to visualize high-dimensional data in a lower-dimensional space, typically 2D or 3D. It is particularly useful for understanding the structure of complex data sets.
How t-SNE Works
-------------------
t-SNE uses a non-linear approach to map high-dimensional data points to a lower-dimensional space. The goal is to preserve the local structure of the data, where similar data points are mapped close to each other in the lower-dimensional space.
The t-SNE algorithm consists of the following steps:
Calculate the similarity between data points : t-SNE calculates the similarity between each pair of data points using a Gaussian distribution.
Compute the pairwise similarities : The similarities are computed for all pairs of data points, resulting in a similarity matrix.
Map to lower-dimensional space : The similarity matrix is then used to map the high-dimensional data points to a lower-dimensional space, typically using a t-distribution.
t-SNE Algorithm
-------------------
The t-SNE algorithm can be summarized as follows:
Input : High-dimensional data set `X` with `n` samples and `d` features.
Output : Lower-dimensional representation `Y` with `n` samples and `p` features (typically 2 or 3).
The t-SNE algorithm uses the following steps:
Compute the similarity matrix `P` using the Gaussian distribution:
`P(i, j) = exp(-||x_i - x_j||^2 / (2 sigma^2))`
where `x_i` and `x_j` are data points, and `sigma` is a hyperparameter.
Compute the t-distribution:
`Q(i, j) = (1 + ||y_i - y_j||^2)^(-1)`
where `y_i` and `y_j` are the lower-dimensional representations of `x_i` and `x_j`, respectively.
Minimize the KL divergence between `P` and `Q`:
`L = KL(P || Q) = sum(P(i, j) log(P(i, j) / Q(i, j)))`
The KL divergence measures the difference between the two distributions.
Update the lower-dimensional representations `Y` using gradient descent:
`Y = Y - alpha dL/dY`
where `alpha` is the learning rate, and `dL/dY` is the gradient of the loss function with respect to `Y`.
Example Use Case: Iris Dataset
-------------------------------
The Iris dataset is a classic example of a high-dimensional data set that can be visualized using t-SNE. The dataset consists of 150 samples from three species of iris flowers (Iris setosa, Iris versicolor, and Iris virginica), each described by 4 features (sepal length, sepal width, petal length, and petal width).