ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Accuracy is misleading on imbalanced datasets. Here is when to use precision, recall, F1, AUC-ROC, MAE, RMSE, and how to choose the right metric for your problem.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#model-evaluation#precision#recall#auc-roc#f1-score#machine-learning-metrics

FIG. ART-33

10 min read

“

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

// reading plan

sections

1,313

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

Precision and Recall

Precision = TP / (TP + FP). Of all examples the model predicted positive, what fraction were actually positive?

High precision means: when the model says "fraud," it is usually right. Low false alarm rate.

Recall (also called sensitivity or true positive rate) = TP / (TP + FN). Of all actually positive examples, what fraction did the model catch?

High recall means: the model catches most positive cases. Low miss rate.

The tension between precision and recall is the most important trade-off in classification problems. You almost always face a choice between the two.

Fraud detection: You probably want high recall (catch most fraud, even at the cost of more false alarms that must be reviewed). Missing fraud is expensive and damages customer trust. Falsely flagging a legitimate transaction is annoying but recoverable.

Medical screening test: High recall is critical (do not miss cancer). You accept lower precision (some flagged patients will not have cancer, leading to follow-up tests). False negatives (missed cancers) are far more costly than false positives (unnecessary follow-up).

Email spam filter: High precision matters more (do not send legitimate email to spam). Users tolerate some spam in their inbox better than they tolerate important emails being filtered out.

There is no universal answer. The right precision/recall trade-off depends on the relative cost of false positives vs. false negatives in your specific problem.

You control this trade-off via the classification threshold. By default, most classifiers predict "positive" if the predicted probability exceeds 0.5. Lowering the threshold increases recall (you flag more things as positive) at the cost of precision. Raising it increases precision at the cost of recall.

F1 Score: Balancing Precision and Recall

F1 = 2 * (precision * recall) / (precision + recall)

F1 is the harmonic mean of precision and recall. It gives a single number that balances both. A model that has high precision but near-zero recall will have a low F1. A model that has high recall but near-zero precision will also have a low F1.

F1 is useful when you need a single number to compare models and you want to reward balance between precision and recall. It is commonly used in NLP benchmarks and information retrieval.

F-beta is a generalization: F-beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall). Setting beta > 1 weights recall higher (useful when missing positives is more costly). Beta = 2 doubles the weight of recall relative to precision.

AUC-ROC: Threshold-Independent Evaluation

The ROC (Receiver Operating Characteristic) curve plots recall (true positive rate) against the false positive rate at every possible classification threshold. A random classifier produces a diagonal line (AUC = 0.5). A perfect classifier has AUC = 1.0.

AUC-ROC (the area under the ROC curve) measures how well a model distinguishes between classes regardless of the threshold you choose. It is threshold-independent, which is useful when you have not yet decided on a threshold or when thresholds will differ across deployment contexts.

AUC-ROC is also robust to class imbalance in the sense that it does not require balancing the dataset before evaluation.

However, AUC-ROC can be misleading for highly imbalanced datasets. When you have 99.9% negative examples, a model with high AUC might still have terrible precision at any useful operating point. In these cases, the Precision-Recall AUC (area under the precision-recall curve) is more informative. It focuses on the minority class performance without being inflated by the large number of true negatives.

MAE vs. RMSE for Regression

For regression problems (predicting continuous values), the common metrics are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

MAE = average of |predicted - actual|. Easy to interpret: the average prediction is off by this many units. Robust to outliers -- a single prediction that is 1000 units off contributes proportionally.

RMSE = square root of the average of (predicted - actual)^2. Because errors are squared before averaging, large errors are penalized much more than small ones. A single prediction that is 1000 units off has a massive impact on RMSE.

Choose MAE when: all errors are approximately equally costly, and you want an interpretable metric in the same units as the target.

Choose RMSE when: large errors are especially harmful and you want to penalize them more. In house price prediction, being off by $100,000 is much worse than being off by $10,000 -- not just 10x worse, but potentially catastrophically worse for the buyer or seller. RMSE captures this non-linear cost of large errors.

When to Optimize Which Metric

The metric you report to stakeholders and the metric you optimize during training should match. If you care about recall in production, include recall in your training loss function or optimize it during threshold selection.

Business context determines the right metric:

Customer churn prediction: Recall (do not miss churning customers who could be retained with outreach)
Credit scoring: Precision + regulatory requirements (cannot discriminate; must be explainable)
Product recommendation: Precision@K (of the top K recommendations shown, how many were relevant?)
Demand forecasting: MAE or MAPE (mean absolute percentage error, useful when you care about relative not absolute error)
Medical diagnosis: Recall for screening, specificity for confirmatory testing
Content moderation: Recall for serious violations, precision for borderline cases

Ask "what is the cost of a false positive relative to a false negative?" before choosing a metric. Quantify these costs if possible. The ratio of costs determines which metric to prioritize.

Keep Reading

Machine Learning Complete Guide for Software Developers -- the full ML pipeline including where evaluation fits
Overfitting and Underfitting: How to Fix Them -- good metrics help you diagnose overfitting early
Anomaly Detection Practical Guide -- a domain where precision-recall metrics are essential and accuracy is useless

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Accuracy: When It Works and When It Lies

The Confusion Matrix: A Better Starting Point

Precision and Recall

F1 Score: Balancing Precision and Recall

AUC-ROC: Threshold-Independent Evaluation

MAE vs. RMSE for Regression

When to Optimize Which Metric

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Transfer Learning Explained: Reusing What Neural Networks Already Know

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Accuracy: When It Works and When It Lies

The Confusion Matrix: A Better Starting Point

Precision and Recall

F1 Score: Balancing Precision and Recall

AUC-ROC: Threshold-Independent Evaluation

MAE vs. RMSE for Regression

When to Optimize Which Metric

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Transfer Learning Explained: Reusing What Neural Networks Already Know

The workspace your team
actually needs