A fraud detection model that flags zero transactions as fraudulent achieves 99.9% accuracy on most datasets, because 99.9% of transactions are legitimate. A model that never fires is useless but technically very accurate. This is the fundamental problem with accuracy on imbalanced datasets -- and it illustrates why choosing the right evaluation metric is one of the most important decisions in an ML project.
The metric you choose is not a neutral technical decision. It defines what "good model" means. Optimize for the wrong metric and you will build a model that looks excellent in development and fails in production.
Accuracy: When It Works and When It Lies
Accuracy = correct predictions / total predictions.
Accuracy is a reasonable metric when: your classes are roughly balanced (the minority class is at least 20-30% of the data) and you care equally about all types of errors.
Accuracy misleads when classes are imbalanced. For the fraud detection example above: 0.1% of transactions are fraudulent. A model that predicts "not fraud" for every transaction is 99.9% accurate. A model that catches 80% of fraud but has a 5% false positive rate is far more valuable but will have lower accuracy.
Any problem with rare events -- fraud, medical disease, equipment failure, cybersecurity intrusions -- will expose accuracy as a useless metric. This covers a surprisingly large fraction of real-world ML applications.
The Confusion Matrix: A Better Starting Point
Before any aggregate metric, look at the confusion matrix. For a binary classifier, it shows four counts:
- True Positives (TP): predicted positive, actually positive (caught fraud)
- True Negatives (TN): predicted negative, actually negative (correctly cleared transaction)
- False Positives (FP): predicted positive, actually negative (false alarm -- real transaction flagged as fraud)
- False Negatives (FN): predicted negative, actually positive (missed fraud -- fraud that slipped through)
Every aggregate metric is a function of these four numbers. Understanding what each metric prioritizes helps you choose the right one.
Precision and Recall
Precision = TP / (TP + FP). Of all examples the model predicted positive, what fraction were actually positive?
High precision means: when the model says "fraud," it is usually right. Low false alarm rate.
Recall (also called sensitivity or true positive rate) = TP / (TP + FN). Of all actually positive examples, what fraction did the model catch?
High recall means: the model catches most positive cases. Low miss rate.
The tension between precision and recall is the most important trade-off in classification problems. You almost always face a choice between the two.
Fraud detection: You probably want high recall (catch most fraud, even at the cost of more false alarms that must be reviewed). Missing fraud is expensive and damages customer trust. Falsely flagging a legitimate transaction is annoying but recoverable.
Medical screening test: High recall is critical (do not miss cancer). You accept lower precision (some flagged patients will not have cancer, leading to follow-up tests). False negatives (missed cancers) are far more costly than false positives (unnecessary follow-up).
Email spam filter: High precision matters more (do not send legitimate email to spam). Users tolerate some spam in their inbox better than they tolerate important emails being filtered out.
There is no universal answer. The right precision/recall trade-off depends on the relative cost of false positives vs. false negatives in your specific problem.
You control this trade-off via the classification threshold. By default, most classifiers predict "positive" if the predicted probability exceeds 0.5. Lowering the threshold increases recall (you flag more things as positive) at the cost of precision. Raising it increases precision at the cost of recall.
F1 Score: Balancing Precision and Recall
F1 = 2 * (precision * recall) / (precision + recall)
F1 is the harmonic mean of precision and recall. It gives a single number that balances both. A model that has high precision but near-zero recall will have a low F1. A model that has high recall but near-zero precision will also have a low F1.
F1 is useful when you need a single number to compare models and you want to reward balance between precision and recall. It is commonly used in NLP benchmarks and information retrieval.
F-beta is a generalization: F-beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall). Setting beta > 1 weights recall higher (useful when missing positives is more costly). Beta = 2 doubles the weight of recall relative to precision.
AUC-ROC: Threshold-Independent Evaluation
The ROC (Receiver Operating Characteristic) curve plots recall (true positive rate) against the false positive rate at every possible classification threshold. A random classifier produces a diagonal line (AUC = 0.5). A perfect classifier has AUC = 1.0.
AUC-ROC (the area under the ROC curve) measures how well a model distinguishes between classes regardless of the threshold you choose. It is threshold-independent, which is useful when you have not yet decided on a threshold or when thresholds will differ across deployment contexts.
AUC-ROC is also robust to class imbalance in the sense that it does not require balancing the dataset before evaluation.
However, AUC-ROC can be misleading for highly imbalanced datasets. When you have 99.9% negative examples, a model with high AUC might still have terrible precision at any useful operating point. In these cases, the Precision-Recall AUC (area under the precision-recall curve) is more informative. It focuses on the minority class performance without being inflated by the large number of true negatives.
MAE vs. RMSE for Regression
For regression problems (predicting continuous values), the common metrics are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
MAE = average of |predicted - actual|. Easy to interpret: the average prediction is off by this many units. Robust to outliers -- a single prediction that is 1000 units off contributes proportionally.
RMSE = square root of the average of (predicted - actual)^2. Because errors are squared before averaging, large errors are penalized much more than small ones. A single prediction that is 1000 units off has a massive impact on RMSE.
Choose MAE when: all errors are approximately equally costly, and you want an interpretable metric in the same units as the target.
Choose RMSE when: large errors are especially harmful and you want to penalize them more. In house price prediction, being off by $100,000 is much worse than being off by $10,000 -- not just 10x worse, but potentially catastrophically worse for the buyer or seller. RMSE captures this non-linear cost of large errors.
When to Optimize Which Metric
The metric you report to stakeholders and the metric you optimize during training should match. If you care about recall in production, include recall in your training loss function or optimize it during threshold selection.
Business context determines the right metric:
- Customer churn prediction: Recall (do not miss churning customers who could be retained with outreach)
- Credit scoring: Precision + regulatory requirements (cannot discriminate; must be explainable)
- Product recommendation: Precision@K (of the top K recommendations shown, how many were relevant?)
- Demand forecasting: MAE or MAPE (mean absolute percentage error, useful when you care about relative not absolute error)
- Medical diagnosis: Recall for screening, specificity for confirmatory testing
- Content moderation: Recall for serious violations, precision for borderline cases
Ask "what is the cost of a false positive relative to a false negative?" before choosing a metric. Quantify these costs if possible. The ratio of costs determines which metric to prioritize.
Keep Reading
- Machine Learning Complete Guide for Software Developers -- the full ML pipeline including where evaluation fits
- Overfitting and Underfitting: How to Fix Them -- good metrics help you diagnose overfitting early
- Anomaly Detection Practical Guide -- a domain where precision-recall metrics are essential and accuracy is useless
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.