Few-Shot Prompting: When It Works, When It Fails, With Real Examples

Few-shot prompting uses 3-5 examples to show the model the pattern you want. When it outperforms fine-tuning, when it fails, and how format sensitivity affects output quality.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 11, 2026

9 min read

// tags

#few-shot-prompting#prompt-engineering#llm#fine-tuning#gpt-3

FIG. ART-24

9 min read

“

Few-Shot Prompting: When It Works, When It Fails, With Real Examples

// reading plan

sections

1,572

words

min read

// Machine Learning

GPT Architecture Explained: Beyond the Surface Level

GPT's autoregressive, decoder-only design enables text generation at scale. Here is how it actually works -- from pretraining data to emergent capabilities to GPT-4o.

9 min read

// Machine Learning

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Few-shot prompting means giving a language model a small number of examples of the input-output pattern you want before presenting your actual input. The model uses those examples to infer what you want and applies the same pattern to your request. It works because LLMs are trained to continue patterns: if you show three examples of input X producing output Y, the model learns that for this conversation, X should produce Y-style output.

The key finding from Brown et al.'s GPT-3 paper (Brown et al., "Language Models are Few-Shot Learners," NeurIPS 2020) is that few-shot prompting approaches the performance of fine-tuned models on many tasks, at zero training cost and with the ability to change the behavior instantly by changing the examples.

Zero-Shot vs. One-Shot vs. Few-Shot: The Actual Output Difference

Let me show the difference concretely, using the same task across all three.

Task: Extract action items from a meeting transcript segment.

Zero-shot prompt:

Extract action items from this meeting transcript.

"John: We need to update the pricing page by Friday. Sarah, can you handle that? Also, we should follow up with the Acme client about their contract renewal. I'll send them an email this week."

Zero-shot output: "The action items are: update the pricing page by Friday (Sarah), follow up with Acme client about contract renewal (John, this week)."

Acceptable, but the format is inconsistent and the output is not machine-parseable.

One-shot prompt (one example provided):

Extract action items from meeting transcripts and return them as a list in this format:
- [ ] Owner: [name] | Task: [description] | Due: [deadline or "not specified"]

Example:
Input: "Mike, please send the onboarding docs to new users by end of day tomorrow. I'll schedule the review meeting for next week."
Output:
- [ ] Owner: Mike | Task: Send onboarding docs to new users | Due: End of day tomorrow
- [ ] Owner: [speaker] | Task: Schedule review meeting | Due: Next week

Now extract from:
"John: We need to update the pricing page by Friday. Sarah, can you handle that? Also, we should follow up with the Acme client about their contract renewal. I'll send them an email this week."

One-shot output:

- [ ] Owner: Sarah | Task: Update pricing page | Due: Friday
- [ ] Owner: John | Task: Follow up with Acme client re: contract renewal | Due: This week

The format is now consistent and parseable. The single example defined the output structure without requiring a long prose description.

Few-shot prompt (three examples provided):

Adding two more examples of varied meeting transcript styles further trains the pattern. The model now handles edge cases like multiple owners on a single task, tasks with no specified owner, and tasks with relative vs. absolute deadlines, because it has seen examples of each.

The output quality difference between zero-shot and one-shot here is larger than the difference between one-shot and few-shot. The first example does the heavy lifting.

How Many Examples Is "Few"?

Brown et al. tested 0, 1, 2, 4, 8, 16, and 32 examples across different tasks. Their finding: performance typically peaks between 3 and 5 examples for most tasks, with diminishing returns beyond that. For some tasks, 8 to 10 examples help. Beyond 10, performance often plateaus or slightly degrades as the examples start consuming context window space that could be used for the actual task.

The practical guidance: start with 3 examples. Test with 1, 3, and 5. Pick the minimum that achieves your quality target. More is not always better, and more always costs more tokens.

One important nuance from the research: the quality of examples matters more than the quantity. Three diverse, representative examples beat six redundant ones. Each example should show a distinct case or edge condition.

When Few-Shot Beats Fine-Tuning

Fine-tuning trains the model on hundreds or thousands of examples, permanently shifting model weights. Few-shot provides examples in the prompt, temporarily showing the pattern without changing anything.

Few-shot outperforms fine-tuning when:

The task changes frequently. If you need the model to classify emails one week and summarize contracts the next, few-shot lets you switch patterns instantly. Fine-tuning requires a new training run for each new task.

You have limited examples. Fine-tuning typically requires hundreds to thousands of labeled examples to be effective. If you have only 10 to 20 examples, few-shot is your only option.

Evaluation speed matters. A few-shot prompt change can be tested in minutes. A fine-tuning run takes hours to days.

The task is already within the model's capabilities. If the model can perform the task with good examples in the prompt, fine-tuning adds cost and complexity without meaningful quality gain.

When Fine-Tuning Beats Few-Shot

Domain-specific knowledge. If the model needs to know internal terminology, proprietary processes, or specialized knowledge not in its training data, fine-tuning on that content makes the knowledge part of the model weights rather than requiring it to be re-injected in every prompt.

Style consistency at scale. If you need every response to follow a specific brand voice and tone, fine-tuning on representative examples produces more consistent adherence than few-shot examples, especially as conversations grow longer and the few-shot examples move further back in the context window.

Cost at very high volume. At millions of API calls per day, a few-shot prompt with 5 examples might add 200 to 400 tokens per request. On GPT-4o at $5 per million input tokens, 400 extra tokens per call at 1 million calls per day is $2,000 per day in example tokens alone. A fine-tuned model with no few-shot examples eliminates that cost.

Very specialized tasks. For highly specialized domains (legal document parsing in specific jurisdictions, medical coding, regulatory compliance in specific industries), fine-tuning on domain-expert-labeled examples can produce accuracy that few-shot prompting cannot match regardless of example count.

Format Sensitivity: How Much Example Format Matters

This is underappreciated. The format of your few-shot examples matters significantly, and inconsistency across examples degrades output quality.

Test: consistent format vs. inconsistent format

Inconsistent examples:

Example 1:
Q: What is 15% of 80?
A: 12

Example 2:
Question: 20% of 150?
Answer: The answer is 30.

Example 3:
Input: Calculate 25 percent of 200.
Output: 25% of 200 = 50

The model sees three different input labels (Q/Question/Input) and three different output formats (bare number, sentence, equation). This inconsistency introduces noise.

Consistent examples:

Q: What is 15% of 80?
A: 12

Q: What is 20% of 150?
A: 30

Q: What is 25% of 200?
A: 50

The consistent format signals clearly: Q: triggers a bare number output. The model applies this pattern reliably.

Rules for example format:

Use identical prefixes for input and output in every example ("Q:" / "A:", "Input:" / "Output:", etc.)
Use identical output format across all examples (all JSON, all prose, all tables)
If your actual task has output variance, show that variance across examples, not formatting variance
Order examples from simple to complex where possible: the model uses the first examples to establish the basic pattern, the later ones to handle edge cases

Few-Shot with Complex Outputs

For tasks requiring structured outputs (JSON extraction, classified labels with justifications, multi-field outputs), few-shot examples are the most reliable way to define the expected structure.

Example: classifying support tickets

Classify each support ticket with: category, priority (1-5), and requires_human (true/false).

Ticket: "My payment keeps failing, I've tried 4 times and my card works fine elsewhere."
{"category": "billing", "priority": 5, "requires_human": true}

Ticket: "How do I export my data as CSV?"
{"category": "feature_question", "priority": 2, "requires_human": false}

Ticket: "I think there's a bug — the dashboard shows different totals than the report."
{"category": "bug_report", "priority": 3, "requires_human": false}

Ticket: "I need to cancel my subscription immediately and get a refund for this month."

The three examples define: the exact JSON keys and their types, the priority scale calibration (billing failure = 5, feature question = 2), and the logic for requires_human (billing issues yes, factual questions no). The model extracts these rules from the examples without them being stated explicitly.

Common Mistakes With Few-Shot Prompting

Using only positive examples. If all your examples are "happy path" inputs with clean outputs, the model may struggle with edge cases, messy inputs, or error conditions. Include at least one example that shows how to handle an imperfect input.

Choosing unrepresentative examples. Examples that are easier than your actual inputs teach the model the wrong difficulty calibration. Use examples that match the typical complexity of your real inputs.

Putting the examples after the task description. Most models process examples more reliably when they come immediately before the actual input. Putting a long task description between your examples and the input reduces the examples' influence.

Not labeling the actual input clearly. After your few-shot examples, mark the actual input explicitly: "Now classify this:" or repeat the input label ("Q:" etc.) consistently.

Keep Reading

Prompt Engineering Complete Guide 2026 — Few-shot prompting in context of every other technique, including when to chain it with CoT
Chain of Thought Prompting: 8 Patterns With Real Before-and-After Examples — CoT and few-shot combine powerfully; few-shot CoT is one of the eight patterns covered
How Large Language Models Work: A Complete Guide Without the Math Overload — Why few-shot works at a mechanistic level, explained without requiring an ML background

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Few-Shot Prompting: When It Works, When It Fails, With Real Examples

Related Articles

GPT Architecture Explained: Beyond the Surface Level

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Zero-Shot vs. One-Shot vs. Few-Shot: The Actual Output Difference

How Many Examples Is "Few"?

When Few-Shot Beats Fine-Tuning

When Fine-Tuning Beats Few-Shot

Format Sensitivity: How Much Example Format Matters

Few-Shot with Complex Outputs

Common Mistakes With Few-Shot Prompting

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Chain of Density Prompting: How to Get Information-Dense Summaries from LLMs

Few-Shot Prompting: When It Works, When It Fails, With Real Examples

Related Articles

GPT Architecture Explained: Beyond the Surface Level

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Zero-Shot vs. One-Shot vs. Few-Shot: The Actual Output Difference

How Many Examples Is "Few"?

When Few-Shot Beats Fine-Tuning

When Fine-Tuning Beats Few-Shot

Format Sensitivity: How Much Example Format Matters

Few-Shot with Complex Outputs

Common Mistakes With Few-Shot Prompting

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Chain of Density Prompting: How to Get Information-Dense Summaries from LLMs

The workspace your team
actually needs