Modern datasets often have hundreds or thousands of features. A dataset of customer behavior might have 500 numeric features. A dataset of text documents represented as word counts might have 50,000 features. An image dataset, even at modest resolution, has thousands of pixel features per image.
High-dimensional data creates problems: visualization becomes impossible, distance metrics become unreliable, algorithms slow down, and overfitting becomes more likely. Dimensionality reduction addresses these problems by finding a lower-dimensional representation that preserves the structure that matters.
The Curse of Dimensionality
Before the algorithms, the problem they solve: in high-dimensional spaces, our geometric intuitions break down completely.
In two dimensions, a random point inside a unit square is almost never near the edge. In 100 dimensions, almost all points are near the edge. Volume concentrates in shells, not cores.
Distance metrics suffer: in high dimensions, the maximum and minimum distances between random points become nearly equal. Everything is approximately equidistant from everything else. This kills algorithms that rely on distance-based similarity (k-nearest neighbors, k-means clustering, kernel SVMs) because "near" and "far" lose meaning.
A related problem: with 1,000 features and 1,000 training examples, every model has more degrees of freedom than data points. Overfitting is nearly guaranteed without heavy regularization.
Dimensionality reduction addresses all of these by finding a low-dimensional representation (typically 2, 3, 10, or 50 dimensions) that preserves the structure relevant to your task.
Principal Component Analysis (PCA): Linear Projection
PCA is the simplest and most widely used dimensionality reduction method. It finds the directions in feature space that explain the most variance in the data, then projects the data onto those directions.
Concretely: the first principal component (PC1) is the direction along which the data varies most. PC2 is the direction of second-highest variance that is orthogonal (perpendicular) to PC1. And so on.
from sklearn.decomposition import PCA
import numpy as np
pca = PCA(n_components=50) # Reduce to 50 dimensions
X_reduced = pca.fit_transform(X)
# How much variance is explained?
print(pca.explained_variance_ratio_.cumsum())
# [0.12, 0.23, 0.33, ..., 0.95]
# 50 components explain 95% of the variance
The explained variance ratio tells you how much information you retained. If 50 components explain 95% of variance in a 500-feature dataset, you have compressed 90% of the features while retaining 95% of the information.
PCA use cases:
- Preprocessing before feeding data to another algorithm (reduces dimensions, speeds up training, reduces overfitting)
- Visualizing data in 2D or 3D (plot PC1 vs. PC2 to see structure)
- Noise reduction (low-variance components often capture noise; dropping them removes noise)
- Feature compression for storage or transmission
PCA limitations: PCA is linear. It can only find linear structure. If your data lies on a curved manifold in high-dimensional space, PCA cannot capture that structure. Two clusters that are linearly overlapping but separable non-linearly will look mixed in a PCA projection.
When to use PCA:
- General-purpose preprocessing for ML pipelines
- When you need a fast, deterministic, reproducible reduction
- When linear relationships are sufficient for your downstream task
- When interpretability matters (principal components are linear combinations of original features)
t-SNE: Non-Linear Visualization
t-SNE (t-distributed Stochastic Neighbor Embedding) is the standard tool for visualizing high-dimensional data in 2D or 3D. It captures non-linear structure that PCA misses.
The algorithm works by: computing pairwise similarities between all points in high-dimensional space (nearby points get high similarity, far points get near-zero similarity), then placing points randomly in 2D space and iteratively adjusting their positions to match the high-dimensional similarity structure as closely as possible.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X)
import matplotlib.pyplot as plt
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10')
plt.title('t-SNE visualization')
plt.show()
t-SNE excels at revealing cluster structure. If your data has natural groupings, t-SNE will make them visually obvious. This is why t-SNE plots are ubiquitous in ML papers for showing that learned representations capture semantically meaningful structure.
t-SNE limitations:
- Slow for large datasets (quadratic complexity -- 10,000 points takes minutes, 100,000 points takes hours)
- Stochastic: different runs produce different layouts
- Perplexity hyperparameter (typically 5-50) strongly affects the output
- Does not preserve global structure well: clusters may be positioned arbitrarily relative to each other
- Not suitable for preprocessing ML pipelines (non-parametric, cannot transform new data without re-running on the full dataset)
When to use t-SNE:
- Visualizing clusters and structure in data for exploration
- Creating diagnostic plots in papers or presentations
- Datasets up to ~50,000 points with compute budget for minutes of runtime
UMAP: Faster and Better Global Structure
UMAP (Uniform Manifold Approximation and Projection) is a newer method that addresses t-SNE's main weaknesses: speed and global structure preservation.
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_2d = reducer.fit_transform(X)
# UMAP is parametric: transform new data without re-running
X_new_2d = reducer.transform(X_new)
UMAP is dramatically faster than t-SNE (10-100x on large datasets) because it uses approximate nearest neighbors rather than exact pairwise distances. It also preserves global structure better: clusters that are genuinely far apart in high-dimensional space will be far apart in the UMAP projection.
UMAP key hyperparameters:
n_neighbors: controls local vs. global structure balance. Small (5-15): focuses on local structure, tighter clusters. Large (50-200): preserves more global structure.min_dist: controls how tightly points cluster in the projection. Small (0.0-0.1): tight clusters. Large (0.5-1.0): spread-out, more uniform distribution.
UMAP advantages over t-SNE:
- 10-100x faster, scales to millions of points
- Better global structure preservation
- Parametric: can transform new data without re-running
- More reproducible (though still has randomness)
- Also useful for preprocessing (reduce to 50 dimensions before clustering)
When to use UMAP:
- Large datasets where t-SNE is too slow
- When you need to transform new data after fitting (production use cases)
- When global structure matters (understanding relationships between clusters)
- As a preprocessing step before clustering (HDBSCAN + UMAP is a powerful combination)
Practical Decision Framework
For preprocessing before ML algorithms: Use PCA. It is fast, deterministic, preserves linear structure, and compresses data efficiently. Reduce to a number of components that explain 90-95% of variance.
For 2D/3D visualization of small datasets (under ~50K): Use t-SNE. It produces the most visually clear cluster separations. Tune perplexity between 5 and 50.
For 2D/3D visualization of large datasets: Use UMAP. Faster and preserves global structure better.
For preprocessing before clustering: Use UMAP with n_components=10-50. Running HDBSCAN or k-means on UMAP-reduced data often produces better clusters than running directly on high-dimensional data.
For interpretability: Use PCA. Principal components are linear combinations of original features and can be inspected to understand what patterns they capture.
scikit-learn Practical Examples
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
# PCA as preprocessing in a pipeline
pipeline = Pipeline([
('pca', PCA(n_components=50)),
('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(X_train, y_train)
Dimensionality reduction is not always necessary. If your dataset has 20 features and 100,000 examples, you are unlikely to benefit from it. The curse of dimensionality becomes practically relevant above roughly 100 features, or when your feature count exceeds your sample count. In these situations, PCA as a preprocessing step is almost always worth trying.
Keep Reading
- Machine Learning Complete Guide for Software Developers -- how dimensionality reduction fits in the broader ML pipeline
- Anomaly Detection Practical Guide -- UMAP is commonly used to visualize anomaly detection results
- Feature Engineering Practical Guide -- dimensionality reduction complements feature engineering for high-dimensional problems
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.