The Arithmetic Problem
In 2022, large language models were already impressive at many tasks, but math was a persistent weakness. A model could explain calculus but fail a simple word problem requiring multiple steps. The issue was not knowledge — it was reasoning process. Standard prompting asked for the answer directly, giving the model no opportunity to work through intermediate steps.
The Chain-of-Thought Discovery
The Wei et al. paper (arXiv:2201.11903) demonstrated a simple fix: include worked examples that show the reasoning process, not just the final answer. Instead of:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have? A: 11
Use:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have? A: Roger starts with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11. The answer is 11.
This forces the model to externalize its reasoning chain, and the improvement was dramatic.
Zero-Shot CoT: "Think Step by Step"
Kojima et al. showed that you do not even need few-shot examples. Simply appending "Let's think step by step" to the prompt elicits chain-of-thought reasoning from sufficiently large models. This works because LLMs have seen problem-solving text in training and can mimic the style when cued.
Emergence at Scale
The critical finding was that CoT is an emergent ability — it only helps models above approximately 100B parameters. Smaller models get worse with CoT prompting because they generate plausible-sounding but incorrect reasoning chains. This was one of the first documented cases of emergent capabilities appearing at scale.
Benchmark Improvements
On GSM8K (grade school math word problems), GPT-3 175B with standard prompting solved 17.9% of problems. With 8-shot chain-of-thought prompting, the same model solved 58.1%. On AQUA-RAT (algebraic word problems) and SVAMP (math with linguistic variations), improvements were similarly large.
Self-Consistency Decoding
A follow-up paper (arXiv:2203.11171) showed that generating 40 different reasoning chains and taking the majority vote on the final answer improves accuracy further. Different chains may take different paths but often reach the same correct answer. Incorrect chains are more likely to disagree with each other.
from openai import OpenAI
client = OpenAI()
def chain_of_thought(problem: str, n_samples: int = 10) -> str:
prompt = f"{problem}
Let's think step by step."
responses = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
n=n_samples,
temperature=0.7,
)
# Extract final answers and take majority vote
answers = [r.message.content.split("answer is")[-1].strip()
for r in responses.choices]
return max(set(answers), key=answers.count)
Modern Implementation
CoT is now standard in production prompts for reasoning tasks. OpenAI's o1/o3 models internalize chain-of-thought as part of their training rather than relying on prompting. Anthropic builds extended thinking into Claude using similar principles.