Classification prompts fail when categories are ambiguous, when the model returns verbose labels instead of controlled vocabulary, or when "none of the above" cases are not handled. Fixing these three problems produces classification that is actually usable in automated pipelines.
Define Each Category Explicitly With Examples
The most common classification prompt looks like this: "Classify this customer message as positive, negative, or neutral." This works about 70% of the time. The other 30% breaks on edge cases that the model handles inconsistently.
The reliable version defines each category explicitly:
Classify the following customer message into exactly one of these categories:
POSITIVE: The customer expresses satisfaction, gratitude, or a compliment about the product or service. Examples: "This is great!", "Really impressed with the support team."
NEGATIVE: The customer expresses dissatisfaction, frustration, or a complaint. Examples: "This is broken", "I've been waiting 3 days with no response."
NEUTRAL: The customer asks a question, makes a factual statement, or the sentiment is genuinely ambiguous. Examples: "When does my subscription renew?", "I ordered item #12345."
Customer message:
[message here]
Respond with only the category label: POSITIVE, NEGATIVE, or NEUTRAL.
The examples in each definition do more work than the definitions themselves. They anchor the model's judgment to your specific use case rather than its general understanding of "positive" and "negative."
Use JSON Output to Prevent Verbose Labels
Without output constraints, the model might return "I would classify this as NEGATIVE because the customer seems frustrated." That is useless in a pipeline expecting a single label.
Two approaches prevent this:
Option 1 - Explicit instruction: "Respond with only the category label. No explanation."
Option 2 - JSON output: Force structured output that is trivially parseable:
Classify the following message. Respond in JSON only:
{"label": "POSITIVE" | "NEGATIVE" | "NEUTRAL"}
Message: [message]
JSON output has an additional benefit: you can extend the schema to include confidence and reasoning without changing your parsing logic:
{
"label": "NEGATIVE",
"confidence": 0.91,
"reasoning": "Customer explicitly states the product is broken and they are frustrated."
}
The reasoning field is useful during development to catch prompt failures without manually auditing outputs.
Handle "None of the Above" Cases
Every real-world classification task has inputs that do not fit any defined category. If you do not handle this, the model will force the closest match, which silently corrupts your data.
Add an explicit "OTHER" or "UNKNOWN" category:
Categories:
- BILLING: Questions or issues about invoices, payments, or subscriptions
- TECHNICAL: Bug reports, errors, or feature questions
- GENERAL: Greetings, compliments, or messages with no clear support need
- OTHER: Does not fit any of the above categories
If the message is ambiguous between two categories, choose the most likely one. If it clearly fits none, use OTHER.
The "most likely one" instruction handles borderline cases without defaulting everything ambiguous to OTHER, which would defeat the purpose.
Multi-Label vs Single-Label Classification
Some inputs belong to multiple categories. A customer message might be both a billing complaint and a technical issue. If your task requires multi-label output, make that explicit:
A message may belong to one or more categories. Return all applicable labels as a JSON array.
{"labels": ["BILLING", "TECHNICAL"]}
If no category applies, return: {"labels": ["OTHER"]}
If your task requires exactly one label, say so explicitly: "Return exactly one category. If multiple seem to apply, choose the one that best describes the primary intent of the message." Without this instruction, models often return multi-label output for single-label tasks.
Confidence Alongside the Label
Confidence scores let you build a human review queue for low-confidence classifications rather than trusting the model on everything:
Classify the message and provide a confidence score from 0.0 to 1.0.
{"label": "NEGATIVE", "confidence": 0.85}
Confidence guide:
- 0.9-1.0: Clear, unambiguous match
- 0.7-0.9: Good match with minor ambiguity
- 0.5-0.7: Uncertain, could fit multiple categories
- Below 0.5: Use "OTHER" instead
Route any output with confidence below 0.7 to human review. This creates a practical hybrid system where the model handles high-confidence cases automatically and humans handle the edge cases.
Note: LLM-generated confidence scores are not calibrated probabilities. They are useful relative signals (0.6 really is less confident than 0.9) but do not treat them as literal probabilities.
Zero-Shot vs Few-Shot vs Fine-Tuning
The choice between these approaches depends on the complexity of your categories and how much labeled data you have.
Zero-shot (no examples, just definitions) works well when:
- Categories are intuitive and clearly separable
- You have fewer than 10 categories
- The task aligns closely with common LLM training data (sentiment, topic classification)
Few-shot (3-10 examples per category) works better when:
- Categories are domain-specific or non-intuitive
- The model makes consistent errors on a specific category type
- You have a small number of labeled examples
Structure few-shot examples to show edge cases, not just clear-cut examples:
Examples:
Message: "The app keeps crashing on iOS 17"
Label: TECHNICAL
Message: "I was charged twice this month"
Label: BILLING
Message: "Why was I charged twice AND now the app crashes when I try to fix it?"
Label: BILLING (primary intent is the billing issue; technical is secondary)
Message: "Hello, I have a question"
Label: GENERAL (no specific issue stated yet)
The third and fourth examples handle the cases that zero-shot gets wrong.
Fine-tuning (training on hundreds to thousands of labeled examples) is appropriate when:
- You have 500+ labeled examples
- The task is highly domain-specific and prompt engineering has plateaued
- You need consistent, fast classification at scale with lower per-call costs
- Latency matters and you want a smaller, faster model
For most teams, start with zero-shot, move to few-shot when you identify failure modes, and fine-tune only if you have the labeled data and the performance gap justifies it.
Consistency Across Runs
A practical problem with LLM classification is that the same input can produce different labels across runs, especially at higher temperatures. For classification tasks, set temperature to 0 to maximize determinism:
temperature: 0
Even at temperature 0, some variation can occur due to model updates or floating point differences in distributed inference. For critical pipelines, log both the input and output so you can audit regressions when the model changes.
Handling Long Inputs
For long inputs (emails, documents), classification accuracy drops because the model may focus on the wrong section. Two techniques help:
Pre-extraction: First extract the relevant part, then classify. "Extract the main complaint from this email, then classify the complaint type."
Focused window: Instruct the model to focus on a specific part. "Classify based on the first paragraph only, which contains the customer's primary issue."
For very long documents, consider classifying section by section and then aggregating, rather than classifying the whole document at once.
Keep Reading
- Prompt Testing Methodology Guide - how to systematically measure classification accuracy
- Structured Output Prompting Guide - JSON schema enforcement and function calling for guaranteed output structure
- Few-Shot Prompting Guide - how to select and structure examples for maximum effect
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.