Most machine learning problems assume you have labeled examples of what you are trying to detect. Anomaly detection is different: you often cannot label anomalies in advance because you do not know what form they will take. You cannot collect examples of "all possible fraud patterns" before fraud happens. You cannot label "all possible equipment failure modes" before equipment fails. Anomaly detection finds what is unusual without being told what unusual looks like.
This unsupervised nature makes anomaly detection both powerful and tricky to evaluate correctly. The evaluation problem alone -- why accuracy is useless and what to use instead -- is where most practitioners go wrong.
What Makes Something an Anomaly
An anomaly is a data point that is significantly different from the rest of the data. This can mean:
Point anomalies: A single instance that is unusual in isolation. A transaction for $50,000 when the average is $85. A server responding in 30 seconds when the average is 200ms.
Contextual anomalies: An instance that is unusual given its context but not unusual in isolation. A temperature of 80 degrees Fahrenheit is normal in summer but anomalous in January. A transaction of $200 is normal on weekdays but unusual at 3am from a foreign IP.
Collective anomalies: A sequence of instances that is collectively unusual even though individual instances are not. A series of small transactions (each under $10) from a single card in rapid succession -- individually normal, collectively suspicious.
Different algorithms are better suited to different types of anomalies. Understanding which type you are detecting shapes your algorithm choice.
Isolation Forest
Isolation Forest is the most widely used general-purpose anomaly detection algorithm. The intuition is elegant: anomalies are rare and different, so they are easier to isolate than normal points.
The algorithm builds many random decision trees. At each node, it randomly selects a feature and a random split value. It partitions the data into left and right branches and repeats recursively. Normal points (which are dense and cluster with similar points) require many splits to isolate. Anomalous points (which are sparse and different from everything else) are isolated in very few splits.
The anomaly score for each point is the average path length across all trees to isolate that point. Short path length = anomalous. Long path length = normal.
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.01, random_state=42)
# contamination: expected fraction of anomalies in the data
model.fit(X_train)
scores = model.decision_function(X_test) # negative = more anomalous
predictions = model.predict(X_test) # -1 = anomaly, 1 = normal
The contamination parameter controls the threshold -- the expected fraction of anomalies. If you set it to 0.01, the model will flag the 1% most anomalous points as anomalies. Setting this correctly requires domain knowledge.
Isolation Forest is fast (linear time complexity), scales to large datasets, handles high-dimensional data reasonably well, and requires no labeled anomalies. It is a strong first choice for most anomaly detection tasks.
One-Class SVM
One-Class SVM learns a boundary around the normal data in a high-dimensional feature space (using the kernel trick). Points outside this boundary are flagged as anomalies.
It learns to answer: "does this point look like it came from the same distribution as the training data?" If not, it is an anomaly.
from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel='rbf', gamma='scale', nu=0.01)
# nu: upper bound on fraction of outliers in training data
model.fit(X_train)
predictions = model.predict(X_test) # -1 = anomaly, 1 = normal
One-Class SVM is effective when anomalies are genuinely in a different region of feature space from normal points. It does not scale as well as Isolation Forest to large datasets (quadratic training time). The kernel and hyperparameters require careful tuning.
Autoencoder Reconstruction Error
Autoencoders are neural networks trained to compress their input into a low-dimensional representation (the bottleneck) and then reconstruct the original input from that compressed representation.
The key insight: if you train an autoencoder only on normal data, it learns to compress and reconstruct normal patterns efficiently. When you feed it an anomalous example, the autoencoder cannot reconstruct it well because it has never learned the patterns that produce that anomaly. The reconstruction error (difference between input and reconstructed output) is high for anomalies and low for normal examples.
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, bottleneck_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, bottleneck_dim)
)
self.decoder = nn.Sequential(
nn.Linear(bottleneck_dim, 64),
nn.ReLU(),
nn.Linear(64, input_dim)
)
def forward(self, x):
return self.decoder(self.encoder(x))
# Reconstruction error as anomaly score
reconstruction = model(x)
error = torch.mean((x - reconstruction) ** 2, dim=1)
Autoencoders work particularly well for complex, high-dimensional data where traditional algorithms struggle. They are commonly used for image anomaly detection (detecting manufacturing defects in product photos) and sequence anomaly detection (detecting unusual network traffic patterns).
The bottleneck size is a critical hyperparameter: too small and the autoencoder cannot reconstruct even normal data; too large and it can reconstruct anomalies too (because the bottleneck is not restrictive enough to force compression of only common patterns).
Why Accuracy is Useless: Use Precision-Recall
If 1% of your data is anomalous, a model that predicts "normal" for everything achieves 99% accuracy. This is the same imbalanced classification problem described in the metrics guide, and it is even more extreme for anomaly detection because anomaly rates are often 0.1% or lower.
The right metrics for anomaly detection:
Precision: Of the points flagged as anomalies, what fraction are actually anomalous? High precision means low false alarm rate.
Recall: Of all actual anomalies, what fraction were flagged? High recall means few missed anomalies.
Precision-Recall AUC: The area under the precision-recall curve across all thresholds. This is the standard evaluation metric for anomaly detection.
Average Precision (AP): A weighted mean of precision scores at each threshold, where each weight is the increase in recall from the previous threshold. Similar to PR-AUC but computed differently.
When you have labeled anomalies for evaluation (even if you cannot use them for training), use precision-recall metrics. When you have no labels at all, you must evaluate the system qualitatively: are the flagged anomalies actually interesting to human reviewers?
Use Case: Fraud Detection
Credit card fraud detection is the canonical anomaly detection application. Labels may exist historically, but new fraud patterns emerge constantly that no historical label covers. Anomaly detection identifies transactions that are unusual relative to that user's normal behavior, regardless of whether that specific fraud pattern was seen before.
Features for fraud detection: transaction amount relative to user's history, time of day, merchant category, geographic location relative to recent activity, frequency of transactions in a time window, velocity of spending.
Both Isolation Forest (for initial suspicious flagging) and autoencoders (for detecting novel fraud patterns) are commonly deployed in production fraud systems, often alongside supervised classifiers trained on historical labeled fraud.
Use Case: System Monitoring
Detecting unusual system behavior (CPU spikes, memory leaks, unusual request patterns, latency anomalies) is a strong fit for time series anomaly detection. Normal system behavior has regular patterns (daily traffic cycles, weekly patterns). Anomalies are deviations from these patterns.
For time series, reconstruction-error-based methods (LSTM autoencoders) or statistical methods (tracking rolling mean and standard deviation, flagging when a new value exceeds mean + 3*sigma) are common. The "seasonal decomposition" approach (separate trend, seasonality, and residual components; flag unusual residuals) works well for well-behaved time series.
Use Case: Quality Control
Manufacturing defect detection in images is a prime autoencoder use case. Train on thousands of images of defect-free products. Flag products whose reconstruction error exceeds a threshold as potentially defective. This works even for defect types never seen before, which is the key advantage over supervised classification.
Anomaly detection is not a replacement for supervised classification when you have labels -- supervised methods will typically achieve better performance when you have sufficient labeled examples. But for rare events, novel patterns, and cases where you cannot anticipate all failure modes in advance, anomaly detection fills a gap that supervised learning cannot.
Keep Reading
- ML Model Evaluation Metrics Guide -- precision-recall tradeoffs explained in full detail
- Machine Learning Complete Guide for Software Developers -- the broader ML landscape to contextualize where anomaly detection fits
- Overfitting and Underfitting: How to Fix Them -- autoencoders overfit too, and the same remedies apply
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.