Labeled data is the bottleneck of supervised machine learning. Model architectures have become commoditized — you can download a state-of-the-art architecture in one line. But labeled data specific to your problem remains expensive and time-consuming to produce. This guide covers the practical methods for labeling data faster and with higher quality.
Choosing a Labeling Tool
The tool you choose affects annotator speed, quality control capabilities, and integration with your training pipeline.
Label Studio (open source) is the most versatile option. It supports text classification, named entity recognition, image classification, object detection, bounding boxes, polygons, audio transcription, video annotation, and more. It runs locally or as a SaaS. The annotation interface is clean and responsive. For most teams, Label Studio is the right starting point — free, full-featured, and actively maintained.
Prodigy is a paid annotation tool from the creators of spaCy. Its primary advantage is tight integration with active learning: it serves examples the model is most uncertain about, which reduces labeling cost dramatically. The annotation speed is very high because of a streamlined interface and keyboard shortcuts. Worth the cost for teams doing NLP annotation at scale.
CVAT (Computer Vision Annotation Tool) is open source and specialized for image and video annotation. It has excellent support for video frame-by-frame annotation, interpolation of bounding boxes between keyframes, and polygon annotation with fine control. The right choice for computer vision tasks, especially video.
Scale AI is an enterprise annotation platform with a workforce of human annotators. You send them data, they return labeled examples. Fast and accurate but expensive — appropriate for companies with annotation budgets in the six figures. Not appropriate for early-stage teams.
Labelbox and Roboflow are managed annotation platforms with labeling workforces, quality management tools, and model-in-the-loop features. Roboflow is particularly strong for object detection datasets with augmentation, versioning, and export to popular training formats.
Writing Annotation Guidelines
Annotation quality is mostly a function of guideline quality, not annotator quality. The same set of annotators will produce wildly different results depending on whether the guidelines are clear or ambiguous.
Good annotation guidelines contain:
The label schema — exactly what labels exist and what they mean. For sentiment classification: Positive means the author expresses satisfaction, approval, or happiness toward the subject. Negative means dissatisfaction, disapproval, or unhappiness. Neutral means the text makes no sentiment evaluation.
Edge case rules — what to do when the example does not fit neatly into a category. For sentiment: if an example has both positive and negative sentiment about different aspects of the product, label it Mixed (if the schema includes Mixed) or use the sentiment toward the primary subject.
Worked examples — 10-20 annotated examples that demonstrate the rules in action. Include examples for each label and for common edge cases. Annotators learn from examples faster than from abstract rules.
A conflict resolution procedure — what annotators should do when they are unsure. Should they skip and flag? Should they annotate their best guess and add a note? Define this explicitly.
Iterate on guidelines before scaling. Annotate 50 examples with 2-3 annotators, compute agreement, review disagreements together, update guidelines, and repeat. Do not scale to 5,000 examples on guidelines that have never been tested for ambiguity.
Quality Control: Inter-Annotator Agreement
Inter-annotator agreement (IAA) measures how consistently different annotators label the same examples. It is the primary diagnostic for guideline clarity and annotator reliability.
Cohen's kappa is the standard metric for two annotators on a classification task. It corrects for chance agreement (the agreement you would get if both annotators were guessing randomly). Kappa of 1.0 is perfect agreement. Kappa of 0 is chance-level agreement.
Interpretation: kappa above 0.8 is strong agreement and indicates clean labels. Kappa 0.6-0.8 is moderate agreement — acceptable but worth improving. Kappa below 0.6 indicates significant disagreement — your guidelines are ambiguous or your label schema is ill-defined. Fix the guidelines before labeling more data.
Fleiss' kappa extends Cohen's kappa to three or more annotators.
Gold standard examples are a complementary quality control mechanism. Insert examples with known correct labels (labeled by a domain expert) into the annotation batch without annotators knowing which examples they are. Annotators who consistently get gold standard examples wrong are unreliable and should be removed from the project.
For high-stakes tasks (medical, legal), require all examples to be annotated by at least two annotators and resolve disagreements explicitly. For lower-stakes tasks, use a random sample (5-10%) for IAA measurement and rely on single-annotator labels for the rest.
Active Learning: Labeling Smarter
Active learning is the practice of selectively labeling examples that provide the most information to the model. Instead of labeling randomly, you label the examples the current model is most uncertain about. This reduces labeling cost by 60-80% compared to random sampling.
The basic active learning loop:
- Label a small seed dataset (100-500 examples) randomly.
- Train a model on the labeled data.
- Run the model on the unlabeled pool and score each example by uncertainty (lowest prediction confidence, highest entropy, or highest margin between the top two predicted classes).
- Send the most uncertain examples to annotators.
- Add newly labeled examples to the training set and retrain.
- Repeat from step 3.
Prodigy implements this loop out of the box for many task types. For custom implementations, scikit-learn's predict_proba or a model's logit outputs provide the uncertainty scores.
Active learning is most effective in the early stages of a project when the labeled dataset is small and any labeled example provides substantial new information. As the labeled dataset grows, the marginal value of active learning decreases.
One caution: purely uncertainty-based sampling can create annotation bias. The model queries examples near its decision boundaries, which may not be representative of the full data distribution. Periodically include random examples in the annotation batch to maintain representativeness.
Programmatic Labeling With Snorkel
Programmatic labeling (as implemented in Snorkel) uses labeling functions to assign noisy labels to large unlabeled datasets without human annotation.
A labeling function is a Python function that takes a data example and returns a label (or ABSTAIN if it cannot confidently label the example):
from snorkel.labeling import labeling_function
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
@labeling_function()
def lf_contains_excellent(x):
return POSITIVE if "excellent" in x.text.lower() else ABSTAIN
@labeling_function()
def lf_contains_terrible(x):
return NEGATIVE if "terrible" in x.text.lower() else ABSTAIN
@labeling_function()
def lf_short_review(x):
return ABSTAIN if len(x.text.split()) < 5 else ABSTAIN
@labeling_function()
def lf_external_model(x):
score = pretrained_sentiment_model(x.text)
return POSITIVE if score > 0.7 else (NEGATIVE if score < 0.3 else ABSTAIN)
Labeling functions can be based on keyword heuristics, regular expressions, external models, knowledge bases, or any other source of weak supervision.
Snorkel's label model takes all labeling functions, learns their accuracy and correlation structure, and outputs a probabilistic label for each example:
from snorkel.labeling.model import LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, lr=0.001)
probs = label_model.predict_proba(L=L_train)
The probabilistic labels can then be used to train a downstream model. The approach works because even noisy labeling functions, when combined appropriately, can produce labels comparable in quality to full human annotation at a fraction of the cost.
Programmatic labeling works best when you have domain expertise that can be encoded in heuristics, large amounts of unlabeled data, and expensive or slow human annotation. It works poorly when labeling functions have very low coverage (most examples get ABSTAIN from all functions).
The Annotation Pipeline in Practice
A practical annotation workflow:
- Define the label schema and write initial guidelines (1-2 days).
- Annotate 50 pilot examples with 2-3 annotators, compute IAA, revise guidelines (1 day).
- Use active learning or programmatic labeling to prioritize which examples to annotate (ongoing).
- Annotate in batches of 200-500 examples, with 10% overlap between annotators for ongoing IAA monitoring.
- Review model performance after each batch and update guidelines if the model is failing on systematic edge cases.
The annotation pipeline is never "done" — it evolves with the model and the data distribution.
Keep Reading
- ML Data Collection Guide — what makes good training data before annotation
- Machine Learning Complete Guide for Software Developers — the full ML landscape
- Feature Engineering Practical Guide — transforming labeled data into model inputs
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.