ML is not a silver bullet, and the teams that get the most out of it are the ones who understand what it actually does well. Classification, ranking, personalization, anomaly detection — these are the areas where ML reliably delivers value. Natural language automation, churn prediction, recommendation engines — also strong. "Make the product smarter" with no further specification — that is where ML projects fail and relationships between PMs and data scientists deteriorate.
This guide is for product managers and non-ML engineers who want to work with ML effectively without becoming ML engineers themselves.
What ML Actually Does Well
Machine learning excels at pattern recognition tasks where you have enough historical data and the patterns are stable over time. The canonical examples:
Classification — Is this email spam or not? Is this transaction fraudulent? Which category does this support ticket belong to? If you have labeled historical examples, ML will learn to classify new ones.
Ranking — Given a set of results, which one should appear first? Search ranking, feed ranking, recommendation ordering — these are all ranking problems. ML learns what users click on and ranks accordingly.
Personalization — Show different content to different users based on their history. Netflix's recommendation engine, Spotify's Discover Weekly, Amazon's "customers also bought" — all personalization at scale.
Anomaly detection — What does "normal" look like, and is this data point outside normal? Fraud detection, infrastructure monitoring, quality control in manufacturing.
Prediction — Given current signals, what will happen? Churn prediction, demand forecasting, estimated time of delivery.
What ML Cannot Do
ML cannot solve problems you do not have enough data for. If you are launching a brand new feature, you have no behavioral data and therefore no signal for a model. Start by collecting data, then consider ML after 3-6 months.
ML cannot compensate for a broken product. If users abandon your checkout flow because the UI is confusing, a recommendation model will not fix that. Fix the fundamental product problem first.
ML cannot define your success metric for you. If you cannot write down what success looks like in measurable terms before the model is trained, you will not be able to evaluate whether the model is working after deployment.
ML cannot handle truly novel situations. A model trained on pre-2020 data did not know COVID was coming. Models extrapolate from patterns they have seen; they do not reason about unprecedented events.
How to Write an ML Product Spec
A good ML spec answers five questions:
1. What decision is the model making?
"Predict whether a user will churn in the next 30 days" is a decision. "Understand user behavior" is not. Make it specific. What action does the model's output trigger?
2. What data exists to train it?
List the signals you have access to. For churn prediction: last login date, feature usage frequency, support ticket count, billing history, plan tier. If you do not know what data exists, this is the first thing to find out — before scoping the project.
3. What does success look like?
Define the metric before training starts. For a fraud detection model: catch 90% of fraudulent transactions with fewer than 1% false positives. For a churn model: identify 70% of churners at least 14 days before they cancel. Make it specific and measurable.
4. What is the cost of a wrong prediction?
False positives and false negatives have different costs in different domains. In fraud detection, a false positive (blocking a legitimate transaction) frustrates a customer. A false negative (missing a fraudulent transaction) costs money. The relative cost of each error type shapes how the model is tuned. PMs need to make this call explicitly.
5. How will the model's output integrate into the product?
Will the model's prediction appear as a UI element? Trigger an automated action? Feed into another system? Defining this early prevents engineering surprises late.
Evaluating Model Readiness for Production
When a data scientist says "the model is ready," here is what to verify:
Precision and recall on held-out data. The model should have been evaluated on data it was never trained on. Ask to see the test set metrics, not just training metrics. If only training metrics are available, the model has not been properly evaluated.
Comparison to a baseline. What does the model beat? The baseline could be a simple heuristic (always predict the majority class), the previous model version, or a rule-based system. A model that achieves 92% accuracy on a problem where predicting the majority class achieves 91% is not useful.
Calibration. If the model outputs a probability (this user has a 78% chance of churning), is that probability actually meaningful? A calibrated model means that of all users assigned 80% churn probability, roughly 80% actually churned. Ask for a calibration plot.
Latency. How long does inference take? For real-time serving (recommendations as a user loads a page), latency matters. If the model takes 500ms to run and your p99 page load is 300ms, you have a problem.
The feature freeze. Before launch, confirm that every feature the model uses in production will be available at inference time. A common failure mode: the model was trained using a feature that requires a database lookup that has not been built into the serving infrastructure. The model works in the notebook, fails in production.
Measuring ML Feature Success in Production
The ground truth lag problem is real. For churn prediction, you do not know if your prediction was right until 30 days later when the user either churns or does not. Build your measurement plan around this lag.
What to measure:
The model's direct metric — precision, recall, AUC on live data. Set up a pipeline that compares predictions to outcomes after the appropriate lag period.
The business metric the model was supposed to move — churn rate, revenue per user, click-through rate. If the model metric looks good but the business metric does not move, the model is not solving the right problem.
The counterfactual — what would have happened without the model? A/B test model-on vs model-off, or compare regions where the model is deployed vs regions where it is not.
What PMs Get Wrong With Data Scientists
The most common failure: treating an ML project like a software project. Software has deterministic requirements — "when the user clicks this button, display this dialog." ML has probabilistic outcomes — "this model should catch 80% of fraud." The evaluation criteria are fundamentally different and the feedback loop is much longer.
Second failure: changing the target metric mid-project. If a data scientist has spent six weeks optimizing for precision and you decide recall is more important, you have invalidated six weeks of work. Define the metric before work starts and change it only if the original metric was genuinely wrong.
Third failure: under-investing in data labeling. Good labels are the foundation of supervised learning. If your labels are noisy (annotators disagreed, instructions were ambiguous, edge cases were inconsistently handled), the model ceiling is low regardless of architecture choices. Budget time and money for data quality.
Fourth failure: expecting a model to work immediately. Models degrade as data drifts over time. Plan for ongoing monitoring and periodic retraining. ML is not a ship-once feature; it is an ongoing system.
The teams that succeed with ML treat it as a discipline that requires its own planning, evaluation criteria, and ongoing maintenance — not as a feature you spec in a ticket and ship on a Friday.
Keep Reading
- Machine Learning Complete Guide for Software Developers — foundational concepts before building ML products
- A/B Testing ML Models in Production — how to measure model impact rigorously
- ML Monitoring and Data Drift Detection — what happens to models after launch
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.