Self-consistency prompting runs the same prompt multiple times with temperature above zero, collects the answers, and returns the most common one. The underlying idea is that different reasoning paths may make different errors, but multiple independent paths that reach the same answer are more likely to be correct than any single path. Wang et al. introduced this in "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (ICLR 2023) and demonstrated accuracy improvements of 5 to 40 percentage points on arithmetic, commonsense, and symbolic reasoning benchmarks.
The method is simple to implement and requires no model changes or fine-tuning. It works with any model that supports temperature sampling. The tradeoff is cost: 5 samples cost 5x the tokens of a single call.
How It Works
- Write a chain-of-thought prompt for your question
- Set temperature to 0.5 to 0.7 (high enough for variation, low enough for coherent reasoning)
- Generate 3 to 10 responses
- Extract the final answer from each response
- Return the most common answer (majority vote)
For math problems and logical reasoning, "the most common answer" is unambiguous. For factual questions with short answers, it works the same way. For qualitative tasks, you look for consensus on the key points rather than an exact match.
Simple implementation in Python:
from collections import Counter
from openai import OpenAI
client = OpenAI()
def self_consistent_answer(question: str, n_samples: int = 5) -> str:
prompt = f"{question}
Let's think step by step."
responses = []
for _ in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.6
)
responses.append(response.choices[0].message.content)
# Extract final answers (simplified — real implementation needs parsing)
# For math: look for the last number in each response
# For classification: look for the label
answers = [extract_final_answer(r) for r in responses]
# Return majority vote
most_common = Counter(answers).most_common(1)[0][0]
return most_common
The Research Behind It
Wang et al.'s 2023 ICLR paper tested self-consistency across multiple LLMs (GPT-3, UL2, PaLM) and multiple task types. Key findings:
On GSM8K (grade school math word problems), self-consistency with 40 samples improved accuracy from 58.1% (single CoT) to 74.4% on GPT-3 — a 16 percentage point improvement. On MATH (harder competition math), improvement was even larger proportionally.
On ARC (commonsense reasoning), improvements ranged from 3 to 8 percentage points depending on model size.
The paper showed that returns diminish after about 10 to 20 samples. Most of the gain comes from the first 5 samples; adding more beyond 10 to 20 provides minimal additional improvement. For practical applications, 5 samples captures roughly 80% of the maximum benefit at 5x (not 40x) the cost.
The mechanism: different reasoning paths make different errors. A model might misread "3 dozen" as 30 in one run and correctly parse it as 36 in four other runs. Majority vote selects the correct interpretation 4 out of 5 times.
When Self-Consistency Helps Most
Self-consistency produces the largest gains on tasks with definitive correct answers and multiple viable reasoning paths. This means:
Arithmetic and math word problems. Different runs may perform operations in different orders or make different arithmetic errors. Majority vote filters most individual mistakes.
Logical reasoning. Multi-step deductions where intermediate errors compound. Self-consistency catches cases where one run takes a wrong branch early.
Factual questions with short answers. "What year did X happen?" or "Who wrote Y?" — the model may confabulate in 1 or 2 out of 5 runs, but if 4 runs agree on the correct answer, majority vote surfaces it.
Classification with nuanced cases. For ambiguous inputs where the model is uncertain, self-consistency reveals that uncertainty (answers split 3-2 or even 2-2-1) so you can flag them for human review.
When Self-Consistency Does Not Help
Creative writing and open-ended generation. There is no "majority vote" for the best version of a poem. All samples are valid; picking the most common is arbitrary.
Tasks where the model is consistently wrong. If the model misunderstands the question in all 5 runs, majority vote returns the consistent wrong answer with high confidence. Self-consistency amplifies consensus, not correctness.
Tasks where all runs agree at temperature 0. If the model has high confidence and all 5 samples give the same answer, you paid 5x the tokens for no improvement. Check whether temperature sampling actually produces variation before committing to self-consistency for a task.
Time-sensitive applications. 5 samples in parallel at 500ms each takes the same wall time as 1 sample. But 5 serial samples takes 2.5 seconds. If latency matters, use parallel calls or accept single-sample answers.
Practical Implementation Notes
Run samples in parallel, not serial. Most API providers support concurrent requests. 5 parallel calls take about the same wall time as 1 call. Only cost, not latency, increases.
Extract answers consistently. The hardest part of self-consistency is answer extraction. For math, you might look for the last number followed by a period. For classification, you look for the label word. For longer answers, you need a semantic similarity approach. A mismatch in extraction (one run says "34 apples" and another says "34") breaks the vote if you do string matching rather than numeric comparison.
Use self-consistency selectively. Run your prompt once at temperature 0 first. If the answer is high-confidence and correct on inspection, stop there. Use self-consistency when single-sample results are inconsistent across runs or when the stakes of an error are high.
When to flag for human review. If the majority answer is held by 3 out of 5 samples, proceed automatically. If the samples split 2-2-1 or 2-3, flag for review. A close split indicates genuine model uncertainty that majority vote cannot reliably resolve.
Cost Calculation
At GPT-4o pricing ($2.50 per million input tokens, $10 per million output tokens as of early 2026), a typical reasoning question might use 500 input tokens and 300 output tokens.
Single sample cost: roughly $0.00425 per call. 5 samples: roughly $0.021 per question. 10 samples: roughly $0.043 per question.
At 10,000 questions per day, 5-sample self-consistency costs roughly $210/day vs. $42.50/day for single samples. For a financial calculator or medical diagnostic tool where the cost of an error is high, this is an easy tradeoff. For a chatbot answering general questions, it is not.
Keep Reading
- Chain of Thought Prompting: 8 Patterns With Real Before-and-After Examples — Self-consistency builds on CoT; the two techniques combine naturally
- Tree of Thought Prompting: When to Use It and When It's Overkill — A related technique that explores multiple reasoning branches within a single call rather than sampling multiple times
- Prompt Engineering Complete Guide 2026 — Full reference with self-consistency alongside every other major technique
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.