The most common reason ML projects fail is not a bad model — it is bad data. A strong model trained on weak data will underperform a simple model trained on clean, representative data. Data collection and curation is the highest-leverage work in most ML projects, and it is consistently under-resourced.
What Makes Data "Good"
Good training data has four properties:
Representative — the data distribution should match the distribution the model will encounter in production. If you train a churn prediction model on users from your enterprise tier but deploy it to SMB users, the model will generalize poorly. Production distribution shift is the number one source of post-launch model degradation.
Labeled accurately — for supervised learning, labels must be correct. A dataset with 10% label noise puts a hard ceiling on model accuracy. Clean labels from a small dataset often outperform noisy labels from a large one.
Sufficient — enough examples per class for the model to learn the pattern, not just memorize examples. More on quantity below.
Diverse — covers the edge cases, not just the common case. A spam filter trained only on obvious spam will miss subtle spam. Include hard negatives and edge cases deliberately.
Data Quantity Requirements
There are no universal rules, but these are reliable starting points:
Simple binary classification with clean tabular features — 1,000 to 10,000 examples per class. Gradient boosted trees work well here.
Image classification with fine-tuning a pretrained model — 100 to 1,000 examples per class. Transfer learning dramatically reduces data requirements.
Training an image classifier from scratch — 10,000+ examples per class. Without this you will overfit.
NLP classification with fine-tuning (BERT, etc.) — 500 to 5,000 examples per class for typical tasks.
LLM fine-tuning for instruction following — 100 to 1,000 high-quality examples can be sufficient with LoRA. Quality matters more than quantity here.
Object detection — 1,000 to 10,000 labeled images depending on the number of classes and how visually complex they are.
These are rough starting points. Always monitor validation loss — if it is still decreasing with more data, get more data. If it has plateaued, more data will not help and the model has hit its ceiling with current architecture and features.
Labeling Strategy 1: Human Annotation
Human annotators produce the highest-quality labels when the task is well-specified and annotators are given clear guidelines with examples. This is the gold standard but also the most expensive.
Annotation tools: Label Studio (open source, excellent for images, text, audio, and video), CVAT (open source, strong for video and image annotation), Prodigy (paid, fast annotation with active learning built in).
Guidelines matter enormously. For sentiment analysis: is a review that says "the product works but the packaging was damaged" positive, negative, or neutral? Annotators will disagree without a clear rule. Write guidelines that cover edge cases explicitly and include worked examples.
Inter-annotator agreement: have multiple annotators label the same examples, then compute Cohen's kappa or Fleiss' kappa. A kappa above 0.8 indicates strong agreement and clean labels. Below 0.6, your task definition is ambiguous and your labels will be noisy. Fix the task definition before scaling annotation.
Labeling Strategy 2: Programmatic Labeling
Programmatic labeling (the Snorkel approach) uses labeling functions — heuristics, rules, or weak classifiers — to assign noisy labels to large unlabeled datasets. Each labeling function has different accuracy and coverage. A generative model combines them, learning to weight each function and resolve disagreements.
The output is a dataset with probabilistic labels at the example level. This dataset is noisier than human-annotated data but can be orders of magnitude larger and cheaper to create.
Programmatic labeling works well when you have domain expertise that can be encoded in rules, when you have large amounts of unlabeled data, and when human annotation is prohibitively expensive. It works poorly when the labeling functions are too noisy or when coverage is low (most examples are not labeled by any function).
Labeling Strategy 3: Self-Supervised Learning
Self-supervised learning avoids explicit labels entirely by using the structure of the data as a supervision signal.
In language: predict the next word (GPT-style) or predict masked words (BERT-style). The label is the actual word in the text, so no human annotation is needed. This is how language models learn general representations from massive unlabeled corpora.
In images: contrastive learning (SimCLR, MoCo) creates labels by treating different augmentations of the same image as positive pairs and different images as negative pairs.
Self-supervised pretraining followed by supervised fine-tuning on a small labeled dataset is the dominant paradigm in NLP and increasingly in vision. You get the benefits of large-scale training without large-scale labeled data.
Labeling Strategy 4: User Feedback Loops
Production systems can generate implicit labels from user behavior. A user clicking on a recommendation is an implicit positive label. A user skipping a recommendation and scrolling further is a weak negative signal. A user explicitly marking something as "not interested" is a strong negative label.
Implicit labels are noisy (a user might not click a good recommendation because they were interrupted) but free and at scale. They are the foundation of recommendation systems at companies like Netflix, Spotify, and YouTube.
The key design decision: what user action constitutes a label, and how do you handle position bias (users are more likely to click items shown at the top regardless of quality)? Inverse propensity scoring can correct for position bias in logged feedback data.
Data Augmentation
Augmentation generates additional training examples by applying transformations that preserve the label:
Images — horizontal flips, rotations (small angles), brightness/contrast adjustments, random crops, Gaussian noise. For medical imaging, be careful: a horizontal flip of a chest X-ray changes the anatomy and may invalidate the label.
Text — back-translation (translate to another language and back), synonym substitution, random deletion of words, random insertion. More aggressive augmentations like paraphrasing with a language model are effective but slow.
Audio — time stretching, pitch shifting, adding background noise, room impulse response convolution.
Tabular — SMOTE (Synthetic Minority Oversampling Technique) generates synthetic examples for underrepresented classes by interpolating between existing examples. Useful for imbalanced datasets.
Augmentation is especially effective when data is scarce. It teaches the model invariances that improve generalization.
The Data Flywheel
The data flywheel is the compounding mechanism by which production ML systems improve themselves. It works like this:
- Deploy a model to production.
- The model serves predictions to users.
- User interactions with those predictions become training signals (clicks, corrections, ratings, downstream outcomes).
- Collect and label those interactions.
- Retrain the model on the expanded dataset.
- The improved model serves better predictions, attracting more users and generating more interactions.
The flywheel creates a defensible competitive moat: the more users you have, the more data you collect, the better your model gets, which attracts more users. This is why ML incumbents in consumer products (Google Search, YouTube, Spotify) are so difficult to displace — their flywheels have been spinning for years.
Designing for the flywheel from the start means thinking about what feedback signals you will collect, how you will store and label them, and how frequently you will retrain.
Data Versioning With DVC
As your dataset evolves (new examples added, labels corrected, augmentations changed), you need to version it the same way you version code. DVC (Data Version Control) does this.
DVC stores large data files in remote storage (S3, GCS, Azure Blob) and tracks them in a lightweight metadata file that lives in git. dvc add data/ stages the dataset, dvc push uploads it to remote storage, and dvc pull downloads it. You can git checkout to any commit and dvc pull to restore the exact dataset state at that point in time.
This is essential for reproducibility: given a model checkpoint, you should be able to reconstruct the exact dataset it was trained on and retrain to the same result.
Keep Reading
- Data Labeling Guide: Efficient and Accurate Annotation — labeling tools, quality control, and active learning in depth
- Machine Learning Complete Guide for Software Developers — the full ML landscape
- Feature Engineering Practical Guide — turning raw data into useful model inputs
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.