Chain-of-Thought Prompting: The Google Paper That Made LLMs Better at Math

Wei et al. 2022 showed that prompting LLMs to show their reasoning steps - chain-of-thought - dramatically improves performance on arithmetic and logical reasoning tasks.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 16, 2026

9 min read

// tags

#chain-of-thought#prompting#reasoning#few-shot#google

FIG. ART-30

9 min read

“

Chain-of-Thought Prompting: The Google Paper That Made LLMs Better at Math

// reading plan

sections

460

words

min read

// Prompt Engineering

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

Maximize output quality by applying structured reasoning pathways and agentic planning frames directly inside prompts.

10 min read

// Prompt Engineering

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

Zero-Shot CoT: "Think Step by Step"

Kojima et al. showed that you do not even need few-shot examples. Simply appending "Let's think step by step" to the prompt elicits chain-of-thought reasoning from sufficiently large models. This works because LLMs have seen problem-solving text in training and can mimic the style when cued.

Emergence at Scale

The critical finding was that CoT is an emergent ability - it only helps models above approximately 100B parameters. Smaller models get worse with CoT prompting because they generate plausible-sounding but incorrect reasoning chains. This was one of the first documented cases of emergent capabilities appearing at scale.

Benchmark Improvements

On GSM8K (grade school math word problems), GPT-3 175B with standard prompting solved 17.9% of problems. With 8-shot chain-of-thought prompting, the same model solved 58.1%. On AQUA-RAT (algebraic word problems) and SVAMP (math with linguistic variations), improvements were similarly large.

Self-Consistency Decoding

A follow-up paper (arXiv:2203.11171) showed that generating 40 different reasoning chains and taking the majority vote on the final answer improves accuracy further. Different chains may take different paths but often reach the same correct answer. Incorrect chains are more likely to disagree with each other.

from openai import OpenAI

client = OpenAI()

def chain_of_thought(problem: str, n_samples: int = 10) -> str:
    prompt = f"{problem}
Let's think step by step."
    responses = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        n=n_samples,
        temperature=0.7,
    )
    # Extract final answers and take majority vote
    answers = [r.message.content.split("answer is")[-1].strip()
               for r in responses.choices]
    return max(set(answers), key=answers.count)

Modern Implementation

CoT is now standard in production prompts for reasoning tasks. OpenAI's o1/o3 models internalize chain-of-thought as part of their training rather than relying on prompting. Anthropic builds extended thinking into Claude using similar principles.

Chain-of-Thought Prompting: The Google Paper That Made LLMs Better at Math

Related Articles

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

The Arithmetic Problem

The Chain-of-Thought Discovery

Zero-Shot CoT: "Think Step by Step"

Emergence at Scale

Benchmark Improvements

Self-Consistency Decoding

Modern Implementation

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

Chain-of-Thought Prompting: The Google Paper That Made LLMs Better at Math

Related Articles

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

The Arithmetic Problem

The Chain-of-Thought Discovery

Zero-Shot CoT: "Think Step by Step"

Emergence at Scale

Benchmark Improvements

Self-Consistency Decoding

Modern Implementation

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

The workspace your team
actually needs