Few-shot prompting works because LLMs are exceptional pattern-matchers. Give the model 3-5 examples of the task and it will apply the pattern to new inputs with far more accuracy than zero-shot prompting. But the wrong examples actively hurt performance. Example selection is a skill, and most practitioners get it wrong by choosing examples that are convenient rather than informative.
The Original Evidence: Brown et al. 2020
The foundational paper on few-shot prompting is "Language Models are Few-Shot Learners" (Brown et al., 2020), which introduced GPT-3 and demonstrated that few-shot prompting dramatically closes the gap with fine-tuned models. Key findings relevant to example selection:
- Performance generally improves with more examples up to 8-16, then plateaus or degrades
- Example quality matters more than quantity — 3 good examples outperform 10 mediocre ones
- The position of examples in the context matters (recency bias — later examples have more influence)
- Random selection of examples from a labeled pool performs surprisingly well as a baseline
The 3-5 example range is a practical sweet spot: enough examples to establish a clear pattern without consuming so many tokens that you push your actual input toward the context limit.
The Five Properties of Good Examples
1. Diverse: Examples should cover the full input space, not cluster around a single type. If you are classifying customer support messages and all your examples are billing complaints, the model will be biased toward billing when it sees ambiguous messages. Include at least one example from each major category in your taxonomy.
Weak (all similar):
- "Broken login button" -> TECHNICAL
- "App crashes on startup" -> TECHNICAL
- "Can't access my account" -> TECHNICAL
Strong (diverse):
- "App crashes on startup" -> TECHNICAL
- "I was charged twice" -> BILLING
- "How do I export my data?" -> FEATURE_REQUEST
- "Thanks, this is great!" -> GENERAL
2. Representative: Examples should reflect the actual distribution of inputs you will see in production. If 60% of your production inputs are questions and 40% are complaints, your examples should roughly reflect that ratio. Unrepresentative examples teach the model the wrong prior.
3. Unambiguous: Each example should have an obvious correct answer that any competent human would agree on. If you have to think about whether an example is a good choice, it is not. Ambiguous examples teach the model inconsistent patterns.
Test for ambiguity: show your examples to a colleague and ask them to label them without seeing your labels. If they disagree with your label, the example is ambiguous — remove it.
4. Relevant: For dynamic example selection, examples should be similar to the specific input at hand. A customer complaint about the mobile app is better served by examples from the mobile app category than by generic classification examples. Relevance increases accuracy on the current input even if it reduces diversity.
The tension between diversity and relevance is real. Resolve it by distinguishing between static and dynamic example sets:
- Static examples (same for all inputs): maximize diversity
- Dynamic examples (selected per input): maximize relevance
5. Correct: This should be obvious, but mislabeled examples in your few-shot set actively harm performance. The model learns from the examples — wrong examples teach wrong patterns. Audit your example set for label errors before using it.
Dynamic Example Selection
Static examples (the same examples for every input) are easy to implement but suboptimal for diverse input distributions. Dynamic selection picks the most relevant examples for each specific input.
The standard approach: embed your example pool using a text embedding model, embed the current input, and retrieve the k most similar examples by cosine similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Pre-compute example embeddings
example_texts = [ex["input"] for ex in example_pool]
example_embeddings = model.encode(example_texts)
def select_examples(query: str, k: int = 4) -> list:
query_embedding = model.encode([query])[0]
similarities = np.dot(example_embeddings, query_embedding) / (
np.linalg.norm(example_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
top_k_indices = np.argsort(similarities)[-k:][::-1]
return [example_pool[i] for i in top_k_indices]
Dynamic selection consistently outperforms static selection, especially for tasks with high input variability. The trade-off is latency and infrastructure (you need an embedding model and an example store).
Negative Examples: Showing What NOT to Do
Most practitioners use only positive examples (input -> correct output). Negative examples — showing the model what the wrong answer looks like and why it is wrong — are underused and often effective.
Negative examples are particularly useful when:
- The model consistently makes one specific type of error
- Two categories are easily confused
- The correct output requires not doing something obvious
Example of what NOT to do:
Message: "Can you help me export my data to CSV?"
Wrong label: TECHNICAL (this looks technical but is a feature question, not a bug)
Correct label: FEATURE_REQUEST
The distinction: TECHNICAL is for bug reports and errors. Questions about how to use existing or requested features are FEATURE_REQUEST even if they sound technical.
The explanation of why the wrong label is wrong is what makes this useful. Without the explanation, the negative example just adds noise.
Ordering Effects: Most Relevant Last
Research consistently shows a recency bias in LLM example processing — examples later in the context have more influence on the output than earlier examples. This has two practical implications:
Order by relevance ascending: Put the most relevant example last. The last example seen before the actual input has the strongest influence.
Do not end on a confusing example: If one of your examples is borderline or unusual, put it in the middle, not at the end.
For dynamic selection, sort retrieved examples by similarity ascending and put the most similar one last:
selected = select_examples(query, k=4)
selected_sorted = sorted(selected, key=lambda ex: cosine_similarity(ex, query))
# Most similar is now last — closest to the actual input
The Formatting Contract
Examples establish a formatting contract: whatever format the examples use, the model will use. If your examples have labels on their own line, the model will put labels on their own line. If they have labels inline, the model will do the same.
This means your examples must use exactly the format you want in the output. Inconsistent formatting across examples produces inconsistent output formatting.
# Consistent formatting (good):
Input: "App crashes on startup"
Label: TECHNICAL
Input: "I was charged twice"
Label: BILLING
# Inconsistent formatting (bad):
Input: "App crashes on startup" -> TECHNICAL
Input: "I was charged twice"
Output: BILLING
When Few-Shot Outperforms Zero-Shot (and When It Does Not)
Few-shot significantly outperforms zero-shot when:
- The task is not a common pattern in the model's training data (specialized domains, unusual formats)
- The output format is non-standard
- The model's zero-shot baseline has consistent failure modes on specific input types
Few-shot provides minimal improvement when:
- The task is well-represented in training data (common sentiment analysis, straightforward classification)
- The model's zero-shot baseline is already very high (above 90%)
- The examples do not cover the failure modes (the wrong examples do not help)
Measure the baseline before adding examples. If zero-shot already achieves 92% accuracy on your test set, few-shot improvements will be marginal. Invest the effort elsewhere.
Keep Reading
- Prompting for Classification Guide — applying few-shot selection to classification tasks
- Prompt Testing Methodology Guide — measuring whether your examples actually improve performance
- The Complete Prompt Engineering Guide (2026) — full foundation for prompt design
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.