Tree of Thoughts: Let LLMs Explore Multiple Reasoning Paths

Tree of Thoughts treats problem-solving as a search problem, having the LLM generate multiple reasoning branches, self-evaluate each, and backtrack when stuck - dramatically improving on hard reasoning tasks.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 14, 2026

9 min read

// tags

#tree-of-thoughts#reasoning#search#prompting#gpt-4

FIG. ART-26

9 min read

“

Tree of Thoughts: Let LLMs Explore Multiple Reasoning Paths

// reading plan

sections

510

words

min read

// Prompt Engineering

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

Maximize output quality by applying structured reasoning pathways and agentic planning frames directly inside prompts.

10 min read

// Prompt Engineering

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

The Game of 24 Benchmark

The most compelling result is on Game of 24: use four numbers and arithmetic operations to reach 24. Example: "4 5 6 10" → (10 - 4) × (6 - 5) × 4 = no... this requires systematic exploration.

GPT-4 with standard CoT solved 4% of Game of 24 problems. GPT-4 with ToT (BFS, b=5, 3 steps) solved 74% - a 18.5x improvement. The game requires exploring multiple arithmetic paths and abandoning ones that cannot reach 24, which BFS over thought trees handles naturally.

Self-Evaluation Prompting

The state evaluator is a standard LLM call:

"Given the current state [partial solution], evaluate whether this is likely to lead to a valid solution. Rate as sure/likely/impossible and explain why."

This simple self-evaluation is surprisingly calibrated - the model knows when it has painted itself into a corner.

def tree_of_thoughts_bfs(llm, problem, n_branches=5, max_depth=4):
    current_nodes = [{"state": problem, "thoughts": []}]

    for depth in range(max_depth):
        next_nodes = []
        for node in current_nodes:
            # Generate candidate thoughts
            candidates = []
            for _ in range(n_branches):
                thought = llm.generate(
                    f"Problem: {problem}
Current progress: {node['thoughts']}
"
                    f"Next step (be concise and specific):"
                )
                candidates.append(thought)

            # Evaluate each candidate
            scored = []
            for thought in candidates:
                score = llm.generate(
                    f"Problem: {problem}
Reasoning so far: {node['thoughts'] + [thought]}
"
                    f"Rate this path as sure/likely/impossible and give score 1-10:"
                )
                scored.append({"thought": thought, "score": score, "state": node})

            # Keep best candidates
            scored.sort(key=lambda x: x["score"], reverse=True)
            next_nodes.extend(scored[:2])  # BFS: keep top 2

        current_nodes = next_nodes[:n_branches]  # Beam width

    return current_nodes[0]

Cost Tradeoffs

ToT generates many more LLM calls than CoT - typically 10-100x more API calls per problem. For a Game of 24 problem, ToT might require 100 GPT-4 calls vs 1 for CoT. This limits practical use to hard problems where quality matters more than cost. The paper recommends ToT for planning, mathematical reasoning, and creative writing where exploration has high value.

Tree of Thoughts: Let LLMs Explore Multiple Reasoning Paths

Related Articles

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

Chain-of-Thought Is Greedy

The Tree of Thoughts Framework

The Game of 24 Benchmark

Self-Evaluation Prompting

Cost Tradeoffs

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

Tree of Thoughts: Let LLMs Explore Multiple Reasoning Paths

Related Articles

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

Chain-of-Thought Is Greedy

The Tree of Thoughts Framework

The Game of 24 Benchmark

Self-Evaluation Prompting

Cost Tradeoffs

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

The workspace your team
actually needs