Chain-of-Thought Is Greedy
Standard chain-of-thought prompting follows a single reasoning path from start to finish. Each step commits to a direction without the ability to backtrack or explore alternatives. For problems where the correct path is not obvious — where wrong early choices lead to dead ends — greedy CoT fails.
The Tree of Thoughts paper (arXiv:2305.10601) by Yao et al. reframes LLM problem-solving as tree search, where each node is a partial solution ("thought") and each edge is a reasoning step forward.
The Tree of Thoughts Framework
ToT generalizes CoT with four components:
1. Thought decomposition: Break the problem into discrete intermediate steps, where each "thought" is a coherent piece of reasoning (a few words to a paragraph).
2. Thought generator: At each node, generate k candidate thoughts using either sampling (diverse but independent) or proposing (asking the model to enumerate k distinct options).
3. State evaluator: For each partial solution, ask the model to evaluate whether it looks promising. This can be a numeric score or a sure/maybe/impossible classification.
4. Search algorithm: Use BFS (breadth-first, maintain the b most promising nodes per level) or DFS (depth-first with backtracking on sure failures).
The Game of 24 Benchmark
The most compelling result is on Game of 24: use four numbers and arithmetic operations to reach 24. Example: "4 5 6 10" → (10 - 4) × (6 - 5) × 4 = no... this requires systematic exploration.
GPT-4 with standard CoT solved 4% of Game of 24 problems. GPT-4 with ToT (BFS, b=5, 3 steps) solved 74% — a 18.5x improvement. The game requires exploring multiple arithmetic paths and abandoning ones that cannot reach 24, which BFS over thought trees handles naturally.
Self-Evaluation Prompting
The state evaluator is a standard LLM call:
"Given the current state [partial solution], evaluate whether this is likely to lead to a valid solution. Rate as sure/likely/impossible and explain why."
This simple self-evaluation is surprisingly calibrated — the model knows when it has painted itself into a corner.
def tree_of_thoughts_bfs(llm, problem, n_branches=5, max_depth=4):
current_nodes = [{"state": problem, "thoughts": []}]
for depth in range(max_depth):
next_nodes = []
for node in current_nodes:
# Generate candidate thoughts
candidates = []
for _ in range(n_branches):
thought = llm.generate(
f"Problem: {problem}
Current progress: {node['thoughts']}
"
f"Next step (be concise and specific):"
)
candidates.append(thought)
# Evaluate each candidate
scored = []
for thought in candidates:
score = llm.generate(
f"Problem: {problem}
Reasoning so far: {node['thoughts'] + [thought]}
"
f"Rate this path as sure/likely/impossible and give score 1-10:"
)
scored.append({"thought": thought, "score": score, "state": node})
# Keep best candidates
scored.sort(key=lambda x: x["score"], reverse=True)
next_nodes.extend(scored[:2]) # BFS: keep top 2
current_nodes = next_nodes[:n_branches] # Beam width
return current_nodes[0]
Cost Tradeoffs
ToT generates many more LLM calls than CoT — typically 10-100x more API calls per problem. For a Game of 24 problem, ToT might require 100 GPT-4 calls vs 1 for CoT. This limits practical use to hard problems where quality matters more than cost. The paper recommends ToT for planning, mathematical reasoning, and creative writing where exploration has high value.