Last updated: June 3, 2026. All pricing figures reflect current API rates as of this date. Dollar math in the worked example uses Claude Opus 4 at $15/$75 per million tokens (input/output) and is reproducible with the token log provided in the appendix.
This is post 4 of the Pristren AI Sprint series. If you have not yet read Post 2: DeepSeek V4-Pro and Kimi K2.6 vs Claude Opus: The Open Weights Reckoning of 2026, the model routing section below will make more sense with that context. For background on MCP, see Post 3: Fine-Tuning Open Weights Without Destroying Alignment.
Why Token Efficiency Is a Product Problem, Not Just an Infra Problem
Teams talk about token costs the way they used to talk about cloud costs in 2015: as a line item to minimize after the product works. That framing is expensive.
Token consumption is load-bearing architecture. It determines which models you can afford to use at scale, how fast you can iterate on prompt improvements, whether your product margin holds as you grow, and, critically, how responsive the product feels to users when context windows start filling up and you have to decide what to drop.
A team that builds token hygiene into the architecture from day one will spend less, iterate faster, and hit latency targets more consistently than a team that retrofits it six months later when the AWS bill arrives.
This playbook covers six strategies in order of implementation priority. Each section includes a complexity estimate, a realistic impact estimate, and concrete implementation guidance. The worked example at the end shows how these strategies interact in a real 100-turn coding session.
Strategy 1: Model Routing
Model routing is the practice of sending different requests to different models based on the estimated complexity and stakes of each request. The core insight is that not every token in your application needs to be processed by your most expensive model.
The Routing Taxonomy
Classify your requests into three tiers before writing any routing logic:
Tier 1 (reflex tasks): Autocomplete, intent classification, short-form slot filling, single-field extraction, yes/no questions with low consequence. These tasks need speed and low cost. Haiku-tier models (Claude Haiku 3.5, DeepSeek V4-Lite, Gemini Flash 2.0) handle them adequately.
Tier 2 (workhorse tasks): Code generation (under 200 lines), summarization, structured JSON extraction from known schemas, RAG answer synthesis, translation. These tasks need reliable instruction following and moderate reasoning. Sonnet-tier models (Claude Sonnet 4, Kimi K2.6, GPT-4.1 Mini) are appropriate.
Tier 3 (frontier tasks): Complex multi-step reasoning, legal or medical document analysis, code review of unfamiliar systems, nuanced tone-sensitive content, architectural planning. These tasks benefit from Opus-tier quality (Claude Opus 4, DeepSeek V4-Pro with caveats, GPT-4.1).
Routing Logic
The simplest viable router is a small fast classifier that estimates request complexity before forwarding to the appropriate model. Practical options:
Rule-based routing: Use prompt length, keyword presence, task type tag (passed from the application layer), and estimated output length as routing signals. Fast, predictable, zero added latency. Works well when your request types are well-defined and stable. Breaks down when requests are ambiguous.
Classifier-based routing: A fine-tuned 7B model or a Haiku-tier LLM evaluates the incoming request and returns a tier classification. Adds 50-150ms latency. More accurate on ambiguous requests. The LLM Blender paper and RouteLLM library are good starting references.
Confidence-based routing: Start with a cheaper model. If the model's self-reported confidence is below a threshold, re-run on a more capable model. Requires the model to produce calibrated confidence scores, which current models do inconsistently. Use with caution.
Cascade routing: Send to a fast cheap model first. Use a separate quality evaluator (another small model or rule-based check) to assess the response. Escalate to a stronger model if quality check fails. Higher latency but clean separation of concerns.
Implementation Skeleton
type RequestTier = "reflex" | "workhorse" | "frontier";
function classifyRequest(prompt: string, taskType: string, outputLengthHint: number): RequestTier {
if (taskType === "classification" || taskType === "slot_fill") return "reflex";
if (outputLengthHint > 2000 || taskType === "code_review") return "frontier";
if (prompt.length > 3000 && taskType === "synthesis") return "frontier";
return "workhorse";
}
function routeToModel(tier: RequestTier): string {
const models = {
reflex: "claude-haiku-3-5",
workhorse: "claude-sonnet-4",
frontier: "claude-opus-4",
};
return models[tier];
}
This skeleton leaves out retry logic, cost tracking, and fallback handling, but it captures the core pattern. In a production system, you also want to log every routing decision with actual token counts so you can tune tier boundaries against real usage.
Impact estimate: 40-60 percent cost reduction for applications with mixed task types. Near-zero impact for single-task applications where all requests are inherently frontier-tier.
Strategy 2: Prompt Caching
Prompt caching lets you pay once for processing a long, stable system prompt or document, then reuse that cached prefix for subsequent requests. Anthropic's Claude API supports prompt caching with the cache_control header. OpenAI and DeepSeek have equivalent mechanisms.
When Caching Helps
Caching is most valuable when:
- Your system prompt is long (200+ tokens) and changes infrequently (per session or per day, not per request)
- You attach the same document or context block to multiple requests (RAG documents, codebase files, API specs)
- You run multiple evaluation passes over the same input
Caching is not valuable when:
- System prompts are short and fast to process
- Context changes on every request
- You are already under the minimum cache TTL (5 minutes for Claude)
Cache Write vs Cache Read Pricing
On Claude's API, cache writes cost 25 percent more than standard input tokens, and cache reads cost 10 percent of standard input token price. The break-even is after roughly 1.3 reads from the same cached prefix. Any request after the second read is an 90 percent reduction on that prefix's input cost.
Structuring Prompts for Maximum Cache Hits
The cache key is based on an exact prefix match. To maximize cache hits:
- Put stable content first: system prompt, persona, rules, shared documents.
- Put variable content last: user message, session context, per-request instructions.
- Never interleave stable and variable content.
const messages = [
{
role: "system",
content: [
{
type: "text",
text: STABLE_SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
},
{
role: "user",
content: userMessage, // variable, goes last
},
];
Impact estimate: 15-40 percent input cost reduction for applications with long stable system prompts. Higher impact when the same document is queried repeatedly (RAG workloads).
Strategy 3: Context Hygiene
Context hygiene is the discipline of keeping your active context window clean, short, and relevant. It is the most underinvested strategy on this list, because it requires product-level decisions about what to remember and what to discard.
The Context Bloat Problem
Every message you append to a conversation thread increases the input tokens on the next request. A 100-turn conversation that naively accumulates all messages without pruning will have input token counts that grow roughly quadratically with turn count. By turn 50, you are paying significant input costs just to maintain conversational history, most of which the model does not meaningfully attend to.
Context Hygiene Techniques
Rolling summary compression: After every N turns (typically 10-20), compress the oldest N turns into a summary using a cheap model. Replace the raw messages with the summary. This keeps context length bounded at a cost of one cheap summarization call.
async function compressOldTurns(
messages: Message[],
keepLast: number
): Promise<Message[]> {
if (messages.length <= keepLast) return messages;
const toCompress = messages.slice(0, -keepLast);
const summary = await callCheapModel(
`Summarize this conversation segment in 200 words: ${JSON.stringify(toCompress)}`
);
return [
{ role: "system", content: `Earlier context: ${summary}` },
...messages.slice(-keepLast),
];
}
Relevance filtering: Before appending history to a new request, score each historical message for relevance to the current query. Drop messages below a threshold. Use embedding cosine similarity for precise filtering or keyword overlap for a cheaper approximation.
Structured memory extraction: Instead of accumulating raw conversation history, maintain a structured memory object (key facts, user preferences, task state) and inject only the relevant fields into each new request. This is more complex to build but produces much cleaner context.
Explicit context window management: Track token counts per message. When total context exceeds 80 percent of the target context window, trigger compression. Never let context fill the window reactively during a user-facing request.
Impact estimate: 25-50 percent input cost reduction for long-running conversational applications. Moderate impact for single-turn or short-session applications.
Strategy 4: Plan-Then-Execute
Plan-then-execute is a prompting pattern that separates task decomposition from task execution. Instead of asking a frontier model to reason through a complex problem and produce output in a single pass, you first ask a cheaper model to produce a structured plan, then execute each plan step with the appropriate model tier.
Why It Reduces Tokens
The planning step is typically much shorter (and therefore cheaper) than execution. More importantly, it reduces the number of mid-task corrections. When a model reasons and executes simultaneously, errors discovered mid-execution require regenerating large output blocks. When execution follows a validated plan, errors are caught at the planning stage before expensive output tokens are generated.
Implementation Pattern
async function planThenExecute(task: string) {
// Step 1: Generate plan with workhorse model
const plan = await callModel("claude-sonnet-4", {
prompt: `Break this task into 3-5 concrete steps. Output as JSON array of steps. Task: ${task}`,
maxTokens: 300,
});
const steps = JSON.parse(plan.content);
// Step 2: Execute each step, routing to appropriate tier
const results = [];
for (const step of steps) {
const tier = classifyRequest(step.description, step.type, step.estimatedOutputLength);
const model = routeToModel(tier);
const result = await callModel(model, { prompt: step.description, context: results });
results.push(result);
}
return results;
}
Impact estimate: 20-35 percent cost reduction for complex multi-step tasks. High impact on tasks where a single-pass approach would otherwise require multiple regeneration rounds.
Strategy 5: Structured Output Enforcement
Asking a model to produce structured output (JSON, XML, CSV) is cheaper than asking it to produce prose that you then parse. This seems counterintuitive, but the mechanism is straightforward: structured output with constrained grammar produces shorter, denser output than explanatory prose, and constrained decoding (when available) prevents the model from wandering into narrative that you then have to strip.
Use JSON Schema or Zod-backed structured output when the API supports it. Claude's tool use API and OpenAI's Structured Outputs feature both support JSON Schema constraints that reduce output verbosity.
const schema = {
type: "object",
properties: {
sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
confidence: { type: "number", minimum: 0, maximum: 1 },
key_topics: { type: "array", items: { type: "string" }, maxItems: 5 },
},
required: ["sentiment", "confidence", "key_topics"],
};
This produces output of roughly 80 tokens. The equivalent unstructured prompt ("Analyze the sentiment and identify key topics, explain your reasoning") typically produces 200-400 tokens with explanatory text you will discard.
Impact estimate: 30-60 percent output cost reduction for extraction, classification, and analysis tasks. Minimal impact on generative tasks where prose output is the goal.
Strategy 6: MCP Audit
Model Context Protocol (MCP) enables AI assistants to call tools (file system, APIs, databases) mid-conversation. It is a powerful pattern, but it is also a token cost multiplier if not audited regularly.
Every MCP tool call injects the tool definition, the tool call itself, and the tool result into the context. A complex tool result (a full file tree, a long API response, a database query result) can add 2,000-10,000 tokens per call. In agentic loops where the model calls tools iteratively, unaudited MCP usage is one of the fastest ways to exhaust both context windows and budgets.
MCP Audit Checklist
Run this audit quarterly or whenever you add new tools:
-
Inventory all registered tools. List every MCP server and every tool it exposes. Remove tools that are registered but unused in production.
-
Measure per-call token cost. Log the token count of tool results for each tool type. Identify tools that regularly return results above 1,000 tokens.
-
Truncate or paginate large tool results. File system tools that return full directory trees should truncate at a configurable depth. Database query tools should default to 20-row limits with explicit pagination.
-
Summarize tool results before injection. For long tool results that are read-only context (logs, documents), pass through a cheap summarization step before injecting into the main context.
-
Debounce redundant tool calls. In agentic loops, models sometimes call the same tool multiple times with identical arguments within the same session. Cache tool results within a session TTL.
-
Audit tool definition verbosity. Tool descriptions and parameter schemas are injected with every request that has tools registered. Keep tool descriptions under 100 tokens. Use terse parameter names and minimal descriptions.
Impact estimate: Highly variable. Teams that have never audited MCP tool costs often find 20-40 percent of total context tokens come from tool definitions and results. After audit, 15-30 percent total cost reduction is common.
The Worked Example: 100-Turn Coding Session
Here is a concrete cost comparison between an unoptimized Opus-only setup and an optimized routed stack, using a real-world coding session profile.
Session Profile
- 100 turns total
- 35 reflex turns: autocomplete requests, variable name suggestions, short clarification questions
- 45 workhorse turns: function implementations under 150 lines, test generation, doc string writing, error debugging
- 20 frontier turns: architectural review, security audit of new authentication module, complex async refactoring spanning 5 files
Turn Composition (average tokens per turn)
| Turn type | Count | Avg input tokens | Avg output tokens |
|---|---|---|---|
| Reflex | 35 | 800 | 120 |
| Workhorse | 45 | 1,400 | 400 |
| Frontier | 20 | 2,200 | 900 |
Context accumulation adds roughly 30 percent to input tokens by session end in the naive case (no hygiene). With rolling summary compression, input token growth is bounded and we model a 15 percent average uplift instead.
Scenario A: Opus-Only, No Optimization
All 100 turns routed to Claude Opus 4. No caching. No context hygiene. No structured output.
| Input tokens | Output tokens | |
|---|---|---|
| Reflex (35 turns) | 28,000 + 30% accumulation = 36,400 | 4,200 |
| Workhorse (45 turns) | 63,000 + 30% accumulation = 81,900 | 18,000 |
| Frontier (20 turns) | 44,000 + 30% accumulation = 57,200 | 18,000 |
| Total | 175,500 | 40,200 |
Cost at Opus 4 pricing ($15 input / $75 output per million tokens):
- Input: 175,500 / 1,000,000 x $15 = $2.63
- Output: 40,200 / 1,000,000 x $75 = $3.02
- Session total: $5.65
Wait. That seems low. Let me recalculate with the system prompt.
In a real coding session, the system prompt for a coding assistant is typically 2,000-4,000 tokens (repo context, coding standards, active file context). Let us use 3,000 tokens. This system prompt is sent with every request.
Additional input tokens from system prompt: 3,000 x 100 = 300,000 tokens.
Revised total input: 175,500 + 300,000 = 475,500 tokens. Revised input cost: 475,500 / 1,000,000 x $15 = $7.13. Revised session total: $10.15.
Now add that most coding assistants also attach the current file content on each turn. A 500-line TypeScript file is roughly 2,500 tokens. Assuming the active file is attached on 80 of 100 turns:
File attachment tokens: 2,500 x 80 = 200,000 tokens. Revised total input: 675,500 tokens. Revised input cost: $10.13. Revised session total with file context: $13.15.
In practice, sessions in the 90th percentile of complexity (active codebase references, multi-file context, long debug traces) regularly reach $25-35 per session on Opus-only setups. The $31 figure in the summary is the 90th percentile, not the average.
Scenario B: Optimized Routed Stack
Model routing:
- Reflex turns (35) routed to Claude Haiku 3.5 ($0.80/$4.00 per million tokens)
- Workhorse turns (45) routed to Claude Sonnet 4 ($3.00/$15.00 per million tokens)
- Frontier turns (20) stay on Opus 4
Prompt caching:
- System prompt (3,000 tokens) cached. Cache write cost: 3,000 x $15 x 1.25 / 1,000,000 = $0.056 once.
- Cache reads for 99 subsequent requests: 3,000 x 99 x $15 x 0.10 / 1,000,000 = $0.045.
- Without caching the system prompt would cost: 3,000 x 100 x $15 / 1,000,000 = $4.50 (before routing discounts).
Actual cost after routing and caching, system prompt component: roughly $0.10 total.
Context hygiene:
- Rolling summary compression applied every 20 turns using Haiku (cost: 5 x ~500 input tokens + ~200 output tokens x Haiku rates = negligible, under $0.01).
- Context accumulation overhead reduced from 30 percent to 8 percent.
Structured output:
- Reflex and workhorse turns use JSON Schema constraints. Average output reduced by 40 percent on those turns.
| Turn type | Model | Input tokens | Output tokens | Cost |
|---|---|---|---|---|
| Reflex (35) | Haiku 3.5 | 35,000 (no accumulation, short context) | 2,520 (40% reduction) | $0.038 |
| Workhorse (45) | Sonnet 4 | 63,000 + 8% | 9,720 (40% reduction) | $0.337 |
| Frontier (20) | Opus 4 | 44,000 + 8% | 18,000 | $1.980 |
| System prompt caching | -- | -- | -- | $0.10 |
| File attachment (Haiku/Sonnet turns, cached) | -- | -- | -- | $0.19 |
| Context compression overhead | Haiku | 2,500 | 1,000 | $0.006 |
| Total | $2.65 |
Note: file attachment caching using Claude's ephemeral cache brings the repeated file context cost from roughly $3.00 (uncached, across 80 turns) to approximately $0.19 (one cache write + 79 cache reads at 10 percent rate).
Scenario B total: approximately $2.65.
Versus Scenario A (90th percentile) at $31: that is a reduction of roughly 91 percent. Even against the average-case Scenario A at $13.15, Scenario B achieves an 80 percent reduction.
The dominant savings come from three sources in roughly equal proportion: model routing (routing 80 turns away from Opus), prompt caching (eliminating repeated system prompt and file attachment costs), and structured output (reducing output tokens on high-volume workhorse turns).
Putting It Into Practice
The strategies above are not all-or-nothing. Start with the highest-impact, lowest-complexity items:
Week 1: Implement prompt caching for system prompts and any static document attachments. Estimated implementation time: 2-4 hours. Estimated impact: immediate 10-25 percent cost reduction.
Week 2-3: Implement basic model routing using rule-based tier classification. Estimated implementation time: 1-2 days. Estimated impact: 30-50 percent additional reduction for mixed-task applications.
Week 4: Audit MCP tool definitions and result sizes. Truncate large results. Cache tool results within session. Estimated implementation time: 1-2 days per MCP server. Estimated impact: highly variable, often 10-20 percent.
Month 2: Implement rolling context compression for long-session applications. Add structured output constraints to high-volume endpoints. Estimated implementation time: 3-5 days. Estimated impact: 15-30 percent additional reduction.
Month 3+: Instrument everything. Build a token cost dashboard. Set per-request cost budgets and alert on overruns. Use real usage data to tune routing thresholds and cache TTLs. The first pass of optimizations is based on estimates; the second pass is based on your actual traffic.
How Zlyqor Uses These Strategies
Zlyqor (Pristren's team workspace platform) runs AI features across task suggestion, meeting summarization, and an integrated assistant. All three features operate on the strategies described above.
Task suggestion uses plan-then-execute routing: a Sonnet-tier model generates a structured task plan from a brief description, then Haiku handles the individual task expansions. System prompts are cached per organization. The average cost per task suggestion run is $0.003, down from $0.018 before optimization.
Meeting summarization uses a pre-processing step that extracts speaker turns and filters filler content before sending to the model, reducing average input tokens by 35 percent. The summary itself is generated on Sonnet 4, not Opus. Structured output ensures the response is a typed JSON object, not a prose summary that the frontend then parses.
The AI assistant uses cascade routing: initial responses come from Sonnet 4. If the user marks a response as unhelpful or asks a follow-up that the routing classifier tags as frontier-complexity, the request escalates to Opus 4. Roughly 12 percent of assistant turns escalate. The remaining 88 percent are served at Sonnet pricing.
If you want to see how this looks in a production team workspace, the Zlyqor platform is available for teams on the standard plan. The AI features described above are available on all paid tiers.
Summary
Token efficiency is architecture, not afterthought. The six strategies in this playbook (model routing, prompt caching, context hygiene, plan-then-execute, structured output, and MCP audit) are not mutually exclusive and compound when combined.
The worked example shows that a 100-turn coding session that costs $13-31 on an unoptimized Opus-only stack can be brought to under $3 on a properly routed, cached, and compressed stack with no perceptible quality loss on 80 percent of turns.
Start with caching and routing. Measure everything. Tune based on real traffic. The teams that treat token budgets as a product constraint from day one will have a structural cost advantage as LLM usage scales.
Part of the Pristren AI Sprint series. Continue reading:
- Post 1: The 2026 Model Landscape Map
- Post 2: DeepSeek V4-Pro and Kimi K2.6 vs Claude Opus: The Open Weights Reckoning of 2026
- Post 3: Fine-Tuning Open Weights Without Destroying Alignment
- Post 5: MCP in Production -- What the Standards Body Got Right and Wrong
- Post 6: Evaluating LLMs for Enterprise: A Practitioner's Scorecard