OpenAI published a post on "Harness engineering" that describes how to leverage Codex in an agent-first world. The idea is simple: instead of treating LLMs as chat bots, you build a harness around them that controls execution, memory, and tool use. This is not a new concept (see LangChain, AutoGPT), but OpenAI's take is more focused on production reliability and safety.
What is a harness?
A harness is a wrapper that manages the LLM's inputs and outputs. It provides:
- A structured prompt that defines the agent's role, tools, and constraints.
- A loop that calls the LLM, parses its output (e.g., function calls), executes tools, and feeds results back.
- Safety checks: rate limits, content filters, human-in-the-loop gates.
OpenAI's example uses Codex (the model behind GitHub Copilot) to write and execute code. The harness is a Python script that runs in a sandboxed environment.
Concrete example: Codex-powered data analysis agent
Let's say you want an agent that can query a database and plot results. Instead of writing a monolithic script, you define tools:
# tools.py
import sqlite3
import matplotlib.pyplot as plt
def query_db(sql: str) -> list:
conn = sqlite3.connect("sales.db")
cur = conn.cursor()
cur.execute(sql)
rows = cur.fetchall()
conn.close()
return rows
def plot_data(x: list, y: list, title: str):
plt.plot(x, y)
plt.title(title)
plt.savefig("output.png")
return "output.png"
Your harness prompt tells Codex it can call these functions. The harness loop:
# harness.py
import openai
import json
from tools import query_db, plot_data
SYSTEM_PROMPT = """You are a data analysis assistant. You have access to these functions:
- query_db(sql: str) -> list
- plot_data(x: list, y: list, title: str) -> str
Respond with a JSON object containing 'function' and 'arguments' keys.
"""
def call_llm(messages):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=messages,
functions=[
{
"name": "query_db",
"parameters": {
"type": "object",
"properties": {
"sql": {"type": "string"}
},
"required": ["sql"]
}
},
{
"name": "plot_data",
"parameters": {
"type": "object",
"properties": {
"x": {"type": "array", "items": {"type": "number"}},
"y": {"type": "array", "items": {"type": "number"}},
"title": {"type": "string"}
},
"required": ["x", "y", "title"]
}
}
],
function_call="auto"
)
return response.choices[0].message
def run_agent(user_query):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_query}
]
while True:
msg = call_llm(messages)
if msg.get("function_call"):
func_name = msg["function_call"]["name"]
args = json.loads(msg["function_call"]["arguments"])
if func_name == "query_db":
result = query_db(**args)
elif func_name == "plot_data":
result = plot_data(**args)
else:
result = "Unknown function"
messages.append(msg)
messages.append({"role": "function", "name": func_name, "content": json.dumps(result)})
else:
return msg["content"]
This is a minimal harness. In production, you add error handling, retries, and sandboxing.
Why harness engineering matters
In an agent-first world, you don't want the LLM to execute arbitrary code. You want it to call predefined tools. The harness is the glue that makes this safe and predictable. OpenAI's post emphasizes that the harness should be simple: a loop, a prompt, and a set of tools. No complex orchestration frameworks needed.
Tradeoffs
- Flexibility vs. safety: A tight harness limits what the agent can do, but prevents hallucinations from causing damage. You decide the tool set.
- Cost: Each loop iteration calls the API. If the agent needs many steps, costs add up. You can set a max iteration limit.
- Latency: Sequential calls increase response time. For real-time apps, consider batching or streaming.
- Model dependency: The harness assumes the model can correctly parse function calls. GPT-4 is good, but smaller models may fail. Test with your model.
Production considerations
- Sandboxing: Always run code execution in a container or VM. Use Docker or gVisor.
- Rate limiting: Protect your API keys and backend services.
- Observability: Log every LLM call, tool execution, and error. Use structured logging.
- Human-in-the-loop: For high-stakes actions (e.g., deleting data), require human approval.
Real-world use cases
- Customer support agents: Harness with tools for ticket lookup, refund processing, knowledge base search.
- Code review assistants: Tools to fetch PRs, run linters, post comments.
- Data pipeline automation: Agents that query databases, transform data, and trigger alerts.
Conclusion
Harness engineering is a practical pattern for building agentic systems with Codex. It's not about making the LLM smarter, but about controlling its outputs through a structured loop. Start with a simple harness, add tools gradually, and always sandbox code execution. The agent-first world is here, but the harness is what makes it safe.
Additional considerations for scaling
When you move from prototype to production, you need to handle more edge cases. For example, the harness should manage conversation history efficiently. If the agent runs many steps, the context window fills up. You can summarize or truncate old messages. Also, consider using streaming responses to reduce perceived latency. OpenAI's API supports streaming, so you can show partial results as the agent works.
Another aspect is tool versioning. As you update tools, the harness must ensure the LLM uses the correct version. You can include version numbers in the function definitions or use separate prompts for different tool sets.
Cost optimization example
Suppose each LLM call costs $0.01 and your agent averages 5 calls per task. That's $0.05 per task. For 10,000 tasks per month, that's $500. You can reduce costs by caching identical tool results or using a cheaper model for simple steps. For instance, use GPT-3.5 for routine lookups and GPT-4 only for complex reasoning.
Error handling pattern
In the loop, add a try-except around tool execution. If a tool fails, return an error message to the LLM so it can retry or ask for clarification. Also, set a maximum number of iterations (e.g., 10) to prevent infinite loops. Log every step for debugging.
Security considerations
Never expose internal credentials to the LLM. Use environment variables or a secrets manager. The harness should sanitize tool outputs before feeding them back to the LLM to prevent prompt injection. For example, if a tool returns user-generated content, strip HTML or limit length.
Testing your harness
Write unit tests for each tool. Then write integration tests that simulate the LLM responses (mock the API). Test edge cases like empty results, malformed function calls, and timeouts. Use a staging environment with a sandboxed database.
Conclusion
Harness engineering is a practical pattern for building agentic systems with Codex. It's not about making the LLM smarter, but about controlling its outputs through a structured loop. Start with a simple harness, add tools gradually, and always sandbox code execution. The agent-first world is here, but the harness is what makes it safe.
Keep Reading:
- What is MCP (Model Context Protocol)? A Practical Guide
- How to Set Up MCP Servers for Claude Code and Cursor
- Alternatives to pnpm: A Comparative Guide
Want to build your own agent harness? Try Zlyqor for managed agent infrastructure: https://app.zlyqor.com/signup