t-SNE (t-distributed Stochastic Neighbor Embedding) Explained
Learn how t-SNE works for dimensionality reduction and data visualization, including high-dimensional embeddings, neighborhood preservation, probability distributions, KL divergence, and clustering visualization.
t-SNE (t-distributed Stochastic Neighbor Embedding)
t-SNE is a nonlinear dimensionality reduction algorithm that converts high-dimensional data into 2D or 3D visualizations while preserving local similarity structure.
t-SNE is a dimensionality reduction technique used to visualize high-dimensional data in lower dimensions, typically:
- 2D
- 3D
It is widely used for:
- clustering visualization
- embedding visualization
- feature space analysis
- latent space exploration
Core Idea
t-SNE preserves:
- local structure
- neighborhood similarity
Points that are close in high-dimensional space remain close in lower-dimensional space.
t-SNE Visualization Example
flowchart LR
A1[Cat Images]
A2[Dog Images]
A3[Car Images]
A1 --> B[t-SNE Projection]
A2 --> B
A3 --> B
B --> C[Clustered 2D Visualization]
Why t-SNE is Needed
High-dimensional data is difficult to visualize directly.
Examples:
- Word embeddings
- Image embeddings
- Transformer hidden states
- Feature vectors
t-SNE converts:
t-SNE uses heavy-tailed distribution to:
- avoid crowding problem
- separate distant clusters better
Applications of t-SNE
- NLP Embeddings
- Image Feature Visualization
- Transformer Embeddings
- Clustering Analysis
- Latent Space Visualization
- Anomaly Detection
High-Level Workflow
flowchart TD
A[High-Dimensional Data] --> B[Compute Pairwise Similarities]
B --> C[Convert to Probability Distribution]
C --> D[t-SNE Optimization]
D --> E[2D or 3D Embedding]
E --> F[Visualization]
Example
Suppose each image is represented by:
t-SNE reduces it into:
for visualization.
Step 1: Similarity in High-Dimensional Space
t-SNE computes probability similarity between points.
Probability that point is neighbor of :
Where:
- = data point
- = variance parameter
Step 2: Similarity in Low-Dimensional Space
Low-dimensional similarity uses Student t-distribution.
Where:
- = low-dimensional embedding
Optimization Objective
t-SNE minimizes divergence between:
- high-dimensional similarity
- low-dimensional similarity
Using KL Divergence:
Optimization Flow
flowchart TD
A[High-Dimensional Similarities P] --> C[KL Divergence Loss]
B[Low-Dimensional Similarities Q] --> C
C --> D[Gradient Descent]
D --> E[Updated Embeddings]
Important Hyperparameters
| Parameter | Purpose |
|---|---|
| Perplexity | Controls neighborhood size |
| Learning Rate | Optimization step size |
| Iterations | Number of optimization steps |
| Dimensions | Output dimension (2D/3D) |
Perplexity
Perplexity balances:
- local structure
- global structure
Typical values:
Advantages
- Excellent visualization quality
- Preserves local neighborhoods
- Works well with embeddings
- Reveals hidden clusters
Limitations
| Limitation | Description |
|---|---|
| Computationally expensive | Slow on large datasets |
| Not deterministic | Different runs vary |
| Poor global distance preservation | Far clusters may distort |
| Primarily visualization tool | Not ideal for downstream ML |
t-SNE vs PCA
| PCA | t-SNE |
|---|---|
| Linear reduction | Nonlinear reduction |
| Fast | Slower |
| Preserves variance | Preserves neighborhoods |
| Good for preprocessing | Good for visualization |
PCA + t-SNE Pipeline
Common workflow:
flowchart TD
A[High-Dimensional Data]
A --> B[PCA Reduction]
B --> C[t-SNE]
C --> D[2D Visualization]
PCA first reduces noise and dimensionality before applying t-SNE.
t-SNE vs UMAP
| t-SNE | UMAP |
|---|---|
| Better local structure | Better global structure |
| Slower | Faster |
| More computationally expensive | More scalable |
| Widely used historically | Increasingly popular |
