Dimensionality Reduction: PCA, t-SNE, and UMAP Explained

High-dimensional data is hard to work with. PCA, t-SNE, and UMAP each reduce it differently. Here is when to use each and how to avoid the curse of dimensionality.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#dimensionality-reduction#pca#t-sne#umap#machine-learning#visualization

FIG. ART-30

10 min read

“

Dimensionality Reduction: PCA, t-SNE, and UMAP Explained

// reading plan

sections

1,254

words

min read

// Developer Tools

How to Get Started with Computer Vision as a Developer?

A hands-on guide for developers entering computer vision: pick the right library, write your first pipeline, and avoid common pitfalls.

4 min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

t-SNE: Non-Linear Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) is the standard tool for visualizing high-dimensional data in 2D or 3D. It captures non-linear structure that PCA misses.

The algorithm works by: computing pairwise similarities between all points in high-dimensional space (nearby points get high similarity, far points get near-zero similarity), then placing points randomly in 2D space and iteratively adjusting their positions to match the high-dimensional similarity structure as closely as possible.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X)

import matplotlib.pyplot as plt
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10')
plt.title('t-SNE visualization')
plt.show()

t-SNE excels at revealing cluster structure. If your data has natural groupings, t-SNE will make them visually obvious. This is why t-SNE plots are ubiquitous in ML papers for showing that learned representations capture semantically meaningful structure.

t-SNE limitations:

Slow for large datasets (quadratic complexity -- 10,000 points takes minutes, 100,000 points takes hours)
Stochastic: different runs produce different layouts
Perplexity hyperparameter (typically 5-50) strongly affects the output
Does not preserve global structure well: clusters may be positioned arbitrarily relative to each other
Not suitable for preprocessing ML pipelines (non-parametric, cannot transform new data without re-running on the full dataset)

When to use t-SNE:

Visualizing clusters and structure in data for exploration
Creating diagnostic plots in papers or presentations
Datasets up to ~50,000 points with compute budget for minutes of runtime

UMAP: Faster and Better Global Structure

UMAP (Uniform Manifold Approximation and Projection) is a newer method that addresses t-SNE's main weaknesses: speed and global structure preservation.

import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_2d = reducer.fit_transform(X)

# UMAP is parametric: transform new data without re-running
X_new_2d = reducer.transform(X_new)

UMAP is dramatically faster than t-SNE (10-100x on large datasets) because it uses approximate nearest neighbors rather than exact pairwise distances. It also preserves global structure better: clusters that are genuinely far apart in high-dimensional space will be far apart in the UMAP projection.

UMAP key hyperparameters:

n_neighbors: controls local vs. global structure balance. Small (5-15): focuses on local structure, tighter clusters. Large (50-200): preserves more global structure.
min_dist: controls how tightly points cluster in the projection. Small (0.0-0.1): tight clusters. Large (0.5-1.0): spread-out, more uniform distribution.

UMAP advantages over t-SNE:

10-100x faster, scales to millions of points
Better global structure preservation
Parametric: can transform new data without re-running
More reproducible (though still has randomness)
Also useful for preprocessing (reduce to 50 dimensions before clustering)

When to use UMAP:

Large datasets where t-SNE is too slow
When you need to transform new data after fitting (production use cases)
When global structure matters (understanding relationships between clusters)
As a preprocessing step before clustering (HDBSCAN + UMAP is a powerful combination)

Practical Decision Framework

For preprocessing before ML algorithms: Use PCA. It is fast, deterministic, preserves linear structure, and compresses data efficiently. Reduce to a number of components that explain 90-95% of variance.

For 2D/3D visualization of small datasets (under ~50K): Use t-SNE. It produces the most visually clear cluster separations. Tune perplexity between 5 and 50.

For 2D/3D visualization of large datasets: Use UMAP. Faster and preserves global structure better.

For preprocessing before clustering: Use UMAP with n_components=10-50. Running HDBSCAN or k-means on UMAP-reduced data often produces better clusters than running directly on high-dimensional data.

For interpretability: Use PCA. Principal components are linear combinations of original features and can be inspected to understand what patterns they capture.

scikit-learn Practical Examples

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

# PCA as preprocessing in a pipeline
pipeline = Pipeline([
    ('pca', PCA(n_components=50)),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(X_train, y_train)

Dimensionality reduction is not always necessary. If your dataset has 20 features and 100,000 examples, you are unlikely to benefit from it. The curse of dimensionality becomes practically relevant above roughly 100 features, or when your feature count exceeds your sample count. In these situations, PCA as a preprocessing step is almost always worth trying.

Keep Reading

Machine Learning Complete Guide for Software Developers -- how dimensionality reduction fits in the broader ML pipeline
Anomaly Detection Practical Guide -- UMAP is commonly used to visualize anomaly detection results
Feature Engineering Practical Guide -- dimensionality reduction complements feature engineering for high-dimensional problems

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

Dimensionality Reduction: PCA, t-SNE, and UMAP Explained

Related Articles

How to Get Started with Computer Vision as a Developer?

ONNX: Export Any ML Model and Run It Anywhere

The Curse of Dimensionality

Principal Component Analysis (PCA): Linear Projection

t-SNE: Non-Linear Visualization

UMAP: Faster and Better Global Structure

Practical Decision Framework

scikit-learn Practical Examples

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Gradient Descent Explained: How Machine Learning Models Actually Learn

Dimensionality Reduction: PCA, t-SNE, and UMAP Explained

Related Articles

How to Get Started with Computer Vision as a Developer?

ONNX: Export Any ML Model and Run It Anywhere

The Curse of Dimensionality

Principal Component Analysis (PCA): Linear Projection

t-SNE: Non-Linear Visualization

UMAP: Faster and Better Global Structure

Practical Decision Framework

scikit-learn Practical Examples

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Gradient Descent Explained: How Machine Learning Models Actually Learn

The workspace your team
actually needs