The Annotation Bottleneck for Tool Use
Teaching an LLM to use tools effectively requires examples: (prompt, tool call, tool result, final answer) tuples. Collecting these by hand is expensive and limits which tools can be taught. Toolformer (arXiv:2302.04761) by Schick et al. at Meta AI proposes a self-supervised approach: let the model generate its own tool-use annotations, keep only the ones that actually help, and train on them.
The Five Tools
Toolformer learns to use five APIs:
- Calculator: arithmetic operations with exact results
- Wikipedia search: retrieve relevant passages for a query
- Question answering system: answer factual questions
- Calendar: get the current date/time
- Machine translation: translate text between languages
Each tool call is represented as a special token sequence: [API_NAME(args) -> result]. For example: "The flight takes [Calculator(3 * 60 + 45) -> 225] 225 minutes."
Self-Supervised Data Generation
The pipeline:
-
Sample positions: For each text document in the training corpus, identify positions where a tool call might help (using the model's own uncertainty as a signal).
-
Propose API calls: Prompt the model to suggest what tool call would be useful at each position, with a few-shot example to guide the format.
-
Filter on utility: Execute each proposed API call. Keep the call only if including the result in the context reduces the model's perplexity on the subsequent text more than a threshold. This ensures only genuinely helpful calls are kept.
-
Fine-tune: Train the base language model on the filtered dataset with real API results inserted.
def filter_api_calls(model, text, position, api_call, api_result, threshold=1.0):
"""
Keep an API call only if it reduces perplexity on the continuation.
"""
continuation = text[position:]
# Perplexity without API call
base_context = text[:position]
ppl_without = compute_perplexity(model, base_context, continuation)
# Perplexity with API call and result inserted
api_context = text[:position] + f"[{api_call} -> {api_result}]"
ppl_with = compute_perplexity(model, api_context, continuation)
# Keep if reduction exceeds threshold
return (ppl_without - ppl_with) > threshold
def compute_perplexity(model, context, text):
input_ids = tokenizer(context + text, return_tensors="pt").input_ids
context_len = len(tokenizer(context).input_ids)
with torch.no_grad():
loss = model(input_ids, labels=input_ids).loss
return torch.exp(loss).item()
Why It Generalizes Better Than Few-Shot
Few-shot prompting for tool use works at inference time but does not internalize tool use into the model weights. The model must parse the few-shot format at every call and can be confused by prompts that differ from the examples. Toolformer bakes tool use into the model — it knows when to call a calculator as a natural part of text generation.
Limitations vs Modern Function Calling
Toolformer was demonstrated on GPT-J (6B) — a relatively small model by 2026 standards. It handles five fixed tools versus the arbitrary function calling in GPT-4, Claude, and Gemini. Modern function calling (OpenAI function calling, Claude tool use) uses RLHF-based training on much larger models with schema-defined tools, producing more reliable and generalizable tool use. But Toolformer's self-supervised principle — use perplexity reduction to filter genuine tool utility — influenced the data generation strategies for modern function-calling training.
Results
Toolformer substantially outperforms GPT-J baseline on tasks requiring calculation, fact retrieval, and temporal reasoning. A 6.7B Toolformer model matches or exceeds much larger models (OPT-66B, GPT-3) on QA and math benchmarks where tool use is allowed.