Output tokens are 3-6x more expensive than input tokens across all major LLM providers. GPT-4o charges $2.50 per million input tokens and $10.00 per million output tokens. Claude 3.5 Sonnet charges $3.00 input and $15.00 output. That asymmetry means reducing output length has a larger per-token impact on cost than reducing input length. Specific prompt instructions and output format choices can reduce token count by 40-60% while preserving the same information.
Why Output Tokens Cost More
The pricing asymmetry exists because generating output tokens is computationally more expensive than processing input tokens. During a forward pass on the transformer, each output token is generated sequentially — the model runs a full forward pass for every single token it produces. Input tokens are processed in parallel in a single forward pass. The computational cost ratio is roughly proportional to the price ratio.
This means optimization effort applied to output length reduction yields higher returns than the same effort applied to input length reduction.
Technique 1: Explicit Length Instructions
The simplest technique: tell the model exactly how long you want its response.
Compare these two prompts:
Without instruction: "Explain what transformer attention is."
Typical output: 400-600 tokens (3-4 paragraphs with introduction, detailed explanation, and closing)
With instruction: "Explain what transformer attention is. Respond in under 100 words."
Typical output: 80-100 tokens
The instruction reduced output by 75-85%. This works for most informational tasks where you do not need exhaustive coverage.
More specific instructions work better than vague ones:
- "Respond in under 100 words" (specific) is better than "be concise" (vague)
- "Give a 3-bullet summary" (format-specific) is better than "summarize briefly"
- "Respond in one sentence" (maximally specific) is best for simple questions
Technique 2: Use JSON With Short Keys
When you need structured output, JSON key names affect token count significantly.
Verbose JSON:
{
"sentiment_classification": "positive",
"confidence_score": 0.94,
"primary_topic": "customer service",
"action_recommendation": "escalate to specialist"
}
Approx. 35-40 tokens for keys alone.
Compact JSON:
{
"s": "positive",
"c": 0.94,
"t": "customer service",
"a": "escalate"
}
Approx. 15-18 tokens for keys alone. A 50-55% reduction in key token overhead.
For high-volume structured outputs, shorter key names compound into meaningful savings. Document the key mapping in your code rather than in the prompt.
Technique 3: Bullet Points Instead of Prose
Prose uses more tokens than bullets for the same information. The connective tissue of paragraphs (transitional phrases, repeated context, hedges) adds tokens without adding information.
Compare:
Prose response (120 tokens): "The main advantages of this approach include improved performance, which can be significant in high-load scenarios. Additionally, the reduced complexity makes maintenance much easier for the engineering team. Finally, the lower infrastructure cost is a key benefit that helps keep the overall operating expenses manageable."
Bullet response (45 tokens): "Benefits:
- Improved performance under high load
- Lower maintenance complexity
- Reduced infrastructure cost"
Same information, 62% fewer tokens. Always specify bullet points in your system prompt for enumerable content.
Technique 4: Remove Preamble With System Prompt Instructions
By default, models tend to acknowledge your request before answering it: "Great question! Here is what you need to know about X..." or "Of course! Let me explain..."
These preambles add 10-30 tokens that carry no information. Eliminate them with a system prompt instruction:
System: Respond directly. Do not acknowledge the question or use opening phrases like "Great question", "Of course", "Certainly", or "I'll help you with that". Start immediately with the answer.
This alone saves 15-25 tokens per interaction. At 1 million interactions per month, that is 15-25 million saved tokens — which at GPT-4o output pricing ($10/1M) is $150-250/month.
Technique 5: Avoid Repeating the Question
Models often paraphrase the user's question before answering it, especially in customer service and FAQ scenarios. Prevent this:
System: Do not repeat or paraphrase the user's question. Begin your response with the answer.
Without instruction: "You asked about our return policy. Our return policy allows returns within 30 days of purchase with a receipt. Items must be in original condition."
With instruction: "Returns are accepted within 30 days of purchase with a receipt. Items must be in original condition."
First version: 38 tokens. Second: 25 tokens. 34% reduction.
Technique 6: Post-Processing Trim
For use cases where you have less control over the model's output behavior (using a third-party model, strict prompt constraints), post-process outputs to trim boilerplate endings.
Models often append phrases like "I hope this helps!", "Let me know if you have any other questions!", or "Is there anything else I can assist you with?" These add tokens and are rarely useful.
BOILERPLATE_ENDINGS = [
"I hope this helps",
"Let me know if you have",
"Is there anything else",
"Feel free to ask",
"Don't hesitate to"
]
def trim_boilerplate(text: str) -> str:
for phrase in BOILERPLATE_ENDINGS:
idx = text.lower().find(phrase.lower())
if idx != -1:
return text[:idx].rstrip()
return text
Measuring Your Current Token Waste
Before optimizing, measure your baseline. Log the input and output token counts for a sample of 1,000 production requests. Then manually review the 50 longest outputs and categorize why they are long:
- Preamble (8-25 tokens)
- Question repetition (10-40 tokens)
- Closing phrases (10-25 tokens)
- Excessive prose for bulleted content (30-100 tokens)
- Over-explanation beyond what was asked (50-200 tokens)
The categories with the most accumulated tokens are where optimization will have the most impact.
Keep Reading
- Cutting LLM API Costs: The Complete Guide — All cost reduction strategies in one place.
- Prompt Caching: Anthropic and OpenAI Guide — How to reduce input token costs for repeated prompts.
- Model Routing Guide — Combine output reduction with cheaper models for maximum savings.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.