Dimensionality Reduction: PCA, t-SNE, and UMAP Explained
High-dimensional data is hard to work with. PCA, t-SNE, and UMAP each reduce it differently. Here is when to use each and how to avoid the curse of dimensionality.
Modern datasets often have hundreds or thousands of features. A dataset of customer behavior might have 500 numeric features. A dataset of text documents represented as word counts might have 50,000 features. An image dataset, even at modest resolution, has thousands of pixel features per image.
High-dimensional data creates problems: visualization becomes impossible, distance metrics become unreliable, algorithms slow down, and overfitting becomes more likely. Dimensionality reduction addresses these problems by finding a lower-dimensional representation that preserves the structure that matters.
The Curse of Dimensionality
Before the algorithms, the problem they solve: in high-dimensional spaces, our geometric intuitions break down completely.
In two dimensions, a random point inside a unit square is almost never near the edge. In 100 dimensions, almost all points are near the edge. Volume concentrates in shells, not cores.
Distance metrics suffer: in high dimensions, the maximum and minimum distances between random points become nearly equal. Everything is approximately equidistant from everything else. This kills algorithms that rely on distance-based similarity (k-nearest neighbors, k-means clustering, kernel SVMs) because "near" and "far" lose meaning.
A related problem: with 1,000 features and 1,000 training examples, every model has more degrees of freedom than data points. Overfitting is nearly guaranteed without heavy regularization.
Dimensionality reduction addresses all of these by finding a low-dimensional representation (typically 2, 3, 10, or 50 dimensions) that preserves the structure relevant to your task.
Principal Component Analysis (PCA): Linear Projection
PCA is the simplest and most widely used dimensionality reduction method. It finds the directions in feature space that explain the most variance in the data, then projects the data onto those directions.
Concretely: the first principal component (PC1) is the direction along which the data varies most. PC2 is the direction of second-highest variance that is orthogonal (perpendicular) to PC1. And so on.
from sklearn.decomposition import PCA
import numpy as np
pca = PCA(n_components=50) # Reduce to 50 dimensions
X_reduced = pca.fit_transform(X)
# How much variance is explained?
print(pca.explained_variance_ratio_.cumsum())
# [0.12, 0.23, 0.33, ..., 0.95]
# 50 components explain 95% of the variance
The explained variance ratio tells you how much information you retained. If 50 components explain 95% of variance in a 500-feature dataset, you have compressed 90% of the features while retaining 95% of the information.
PCA use cases:
Preprocessing before feeding data to another algorithm (reduces dimensions, speeds up training, reduces overfitting)
Visualizing data in 2D or 3D (plot PC1 vs. PC2 to see structure)
Noise reduction (low-variance components often capture noise; dropping them removes noise)
Feature compression for storage or transmission
PCA limitations: PCA is linear. It can only find linear structure. If your data lies on a curved manifold in high-dimensional space, PCA cannot capture that structure. Two clusters that are linearly overlapping but separable non-linearly will look mixed in a PCA projection.
When to use PCA:
General-purpose preprocessing for ML pipelines
When you need a fast, deterministic, reproducible reduction
When linear relationships are sufficient for your downstream task
When interpretability matters (principal components are linear combinations of original features)
Team workspace
Ship faster with chat, meetings, and projects in one place — Zlyqor.
t-SNE (t-distributed Stochastic Neighbor Embedding) is the standard tool for visualizing high-dimensional data in 2D or 3D. It captures non-linear structure that PCA misses.
The algorithm works by: computing pairwise similarities between all points in high-dimensional space (nearby points get high similarity, far points get near-zero similarity), then placing points randomly in 2D space and iteratively adjusting their positions to match the high-dimensional similarity structure as closely as possible.
t-SNE excels at revealing cluster structure. If your data has natural groupings, t-SNE will make them visually obvious. This is why t-SNE plots are ubiquitous in ML papers for showing that learned representations capture semantically meaningful structure.
t-SNE limitations:
Slow for large datasets (quadratic complexity -- 10,000 points takes minutes, 100,000 points takes hours)
Stochastic: different runs produce different layouts
Perplexity hyperparameter (typically 5-50) strongly affects the output
Does not preserve global structure well: clusters may be positioned arbitrarily relative to each other
Not suitable for preprocessing ML pipelines (non-parametric, cannot transform new data without re-running on the full dataset)
When to use t-SNE:
Visualizing clusters and structure in data for exploration
Creating diagnostic plots in papers or presentations
Datasets up to ~50,000 points with compute budget for minutes of runtime
UMAP: Faster and Better Global Structure
UMAP (Uniform Manifold Approximation and Projection) is a newer method that addresses t-SNE's main weaknesses: speed and global structure preservation.
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_2d = reducer.fit_transform(X)
# UMAP is parametric: transform new data without re-running
X_new_2d = reducer.transform(X_new)
UMAP is dramatically faster than t-SNE (10-100x on large datasets) because it uses approximate nearest neighbors rather than exact pairwise distances. It also preserves global structure better: clusters that are genuinely far apart in high-dimensional space will be far apart in the UMAP projection.
UMAP key hyperparameters:
n_neighbors: controls local vs. global structure balance. Small (5-15): focuses on local structure, tighter clusters. Large (50-200): preserves more global structure.
min_dist: controls how tightly points cluster in the projection. Small (0.0-0.1): tight clusters. Large (0.5-1.0): spread-out, more uniform distribution.
UMAP advantages over t-SNE:
10-100x faster, scales to millions of points
Better global structure preservation
Parametric: can transform new data without re-running
More reproducible (though still has randomness)
Also useful for preprocessing (reduce to 50 dimensions before clustering)
When to use UMAP:
Large datasets where t-SNE is too slow
When you need to transform new data after fitting (production use cases)
When global structure matters (understanding relationships between clusters)
As a preprocessing step before clustering (HDBSCAN + UMAP is a powerful combination)
Practical Decision Framework
For preprocessing before ML algorithms: Use PCA. It is fast, deterministic, preserves linear structure, and compresses data efficiently. Reduce to a number of components that explain 90-95% of variance.
For 2D/3D visualization of small datasets (under ~50K): Use t-SNE. It produces the most visually clear cluster separations. Tune perplexity between 5 and 50.
For 2D/3D visualization of large datasets: Use UMAP. Faster and preserves global structure better.
For preprocessing before clustering: Use UMAP with n_components=10-50. Running HDBSCAN or k-means on UMAP-reduced data often produces better clusters than running directly on high-dimensional data.
For interpretability: Use PCA. Principal components are linear combinations of original features and can be inspected to understand what patterns they capture.
scikit-learn Practical Examples
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
# PCA as preprocessing in a pipeline
pipeline = Pipeline([
('pca', PCA(n_components=50)),
('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(X_train, y_train)
Dimensionality reduction is not always necessary. If your dataset has 20 features and 100,000 examples, you are unlikely to benefit from it. The curse of dimensionality becomes practically relevant above roughly 100 features, or when your feature count exceeds your sample count. In these situations, PCA as a preprocessing step is almost always worth trying.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.