Toolformer: Teaching LLMs to Use Tools Without Human Annotations

Toolformer learns to call external APIs - calculators, search engines, calendars - by self-supervising on when API calls improve prediction, requiring no human-labeled examples of tool use.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 22, 2026

9 min read

// tags

#toolformer#tool-use#self-supervised#apis#meta-ai

FIG. ART-35

9 min read

“

Toolformer: Teaching LLMs to Use Tools Without Human Annotations

// reading plan

sections

533

words

min read

// Prompt Engineering

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

How to guarantee LLMs conform to schema specifications for databases and APIs using instructor libraries and native compiler features.

8 min read

// Prompt Engineering

ReAct Prompting: How to Make LLMs Reason and Act in Alternating Steps

Self-Supervised Data Generation

The pipeline:

Sample positions: For each text document in the training corpus, identify positions where a tool call might help (using the model's own uncertainty as a signal).
Propose API calls: Prompt the model to suggest what tool call would be useful at each position, with a few-shot example to guide the format.
Filter on utility: Execute each proposed API call. Keep the call only if including the result in the context reduces the model's perplexity on the subsequent text more than a threshold. This ensures only genuinely helpful calls are kept.
Fine-tune: Train the base language model on the filtered dataset with real API results inserted.

def filter_api_calls(model, text, position, api_call, api_result, threshold=1.0):
    """
    Keep an API call only if it reduces perplexity on the continuation.
    """
    continuation = text[position:]

    # Perplexity without API call
    base_context = text[:position]
    ppl_without = compute_perplexity(model, base_context, continuation)

    # Perplexity with API call and result inserted
    api_context = text[:position] + f"[{api_call} -> {api_result}]"
    ppl_with = compute_perplexity(model, api_context, continuation)

    # Keep if reduction exceeds threshold
    return (ppl_without - ppl_with) > threshold

def compute_perplexity(model, context, text):
    input_ids = tokenizer(context + text, return_tensors="pt").input_ids
    context_len = len(tokenizer(context).input_ids)
    with torch.no_grad():
        loss = model(input_ids, labels=input_ids).loss
    return torch.exp(loss).item()

Why It Generalizes Better Than Few-Shot

Few-shot prompting for tool use works at inference time but does not internalize tool use into the model weights. The model must parse the few-shot format at every call and can be confused by prompts that differ from the examples. Toolformer bakes tool use into the model - it knows when to call a calculator as a natural part of text generation.

Limitations vs Modern Function Calling

Toolformer was demonstrated on GPT-J (6B) - a relatively small model by 2026 standards. It handles five fixed tools versus the arbitrary function calling in GPT-4, Claude, and Gemini. Modern function calling (OpenAI function calling, Claude tool use) uses RLHF-based training on much larger models with schema-defined tools, producing more reliable and generalizable tool use. But Toolformer's self-supervised principle - use perplexity reduction to filter genuine tool utility - influenced the data generation strategies for modern function-calling training.

Results

Toolformer substantially outperforms GPT-J baseline on tasks requiring calculation, fact retrieval, and temporal reasoning. A 6.7B Toolformer model matches or exceeds much larger models (OPT-66B, GPT-3) on QA and math benchmarks where tool use is allowed.

Toolformer: Teaching LLMs to Use Tools Without Human Annotations

Related Articles

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

The Annotation Bottleneck for Tool Use

The Five Tools

Self-Supervised Data Generation

Why It Generalizes Better Than Few-Shot

Limitations vs Modern Function Calling

Results

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ReAct Prompting: How to Make LLMs Reason and Act in Alternating Steps

How to Add AI to Your Startup Without Overbuilding

Toolformer: Teaching LLMs to Use Tools Without Human Annotations

Related Articles

Structured Outputs from LLMs: Leveraging JSON Mode and Tool Calling

The Annotation Bottleneck for Tool Use

The Five Tools

Self-Supervised Data Generation

Why It Generalizes Better Than Few-Shot

Limitations vs Modern Function Calling

Results

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ReAct Prompting: How to Make LLMs Reason and Act in Alternating Steps

How to Add AI to Your Startup Without Overbuilding

The workspace your team
actually needs