Reliably parsing LLM outputs is one of the most underestimated challenges in production AI applications. You ask for JSON, you get JSON plus an explanation paragraph. You ask for a list, you get a list with inconsistent formatting. You add schema instructions, success rate goes from 70% to 85% — still not good enough. Here is the full spectrum of approaches and where each lands on reliability.
The Reliability Spectrum
Not all output formats are equally parseable. The reliability of structured output is a function of how much you constrain the model's response format.
Level 0 — Free-form prose (0% parseable programmatically) No format instruction. "Describe the sentiment of this review." The model writes a paragraph. You cannot parse this without another model call.
Level 1 — Format instruction in prose (50-70% reliable) "Respond with only JSON." The model usually does, but occasionally adds a preamble ("Here is the JSON you requested:") or wraps the JSON in markdown code blocks.
Level 2 — JSON schema in prompt (80-88% reliable) You provide the exact JSON schema the model should follow, with field names and types. Reliability improves significantly but still fails when the model cannot determine a value and generates an explanation instead of null.
Level 3 — Function calling or tool use (95-99% reliable) Native function calling (OpenAI), tool use (Anthropic), or function declarations (Google) route the model's structured output through a separate mechanism that guarantees format compliance. The model cannot return free-form text — only valid function arguments.
Level 4 — Structured output with schema validation (99%+ reliable + type safety) OpenAI's Structured Outputs feature (available in gpt-4o-2024-08-06+) and similar provider features guarantee 100% schema compliance at the API level. The API rejects responses that do not match your schema and retries internally. Combine this with Zod or Pydantic validation in your application for end-to-end type safety.
Level 1: Extracting JSON from Markdown Code Blocks
When you ask a model to return JSON and it wraps it in a markdown code block, you can extract the content with a regular expression:
import re
import json
def extract_json(text: str) -> dict:
# Try to parse the whole response as JSON first
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Look for JSON in a markdown code block
match = re.search(r'```(?:json)?s*({.*?}|[.*?])s*```', text, re.DOTALL)
if match:
return json.loads(match.group(1))
# Look for any JSON object or array in the text
match = re.search(r'({.*?}|[.*?])', text, re.DOTALL)
if match:
return json.loads(match.group(1))
raise ValueError(f"Could not extract JSON from response: {text[:200]}")
This handles the three most common cases: valid JSON response, JSON in a code block, and JSON embedded in prose.
Level 2: JSON Schema in the Prompt
When using Level 1 prompting and seeing 70% parse success, adding a schema to the prompt typically brings this to 80-88%. Be explicit about types and which fields can be null.
Extract the key information from the following customer support email. Return ONLY a JSON object with no additional text. Use this exact schema:
{
"customer_name": string | null,
"issue_category": "billing" | "technical" | "account" | "other",
"urgency": "low" | "medium" | "high",
"key_details": string,
"previous_ticket_mentioned": boolean
}
If a field's value cannot be determined from the email, set it to null (for nullable fields) or choose the best matching option (for enum fields).
Email: [email content here]
The most important additions: specify exactly which fields are nullable, specify enum values explicitly, and include the instruction to return ONLY JSON.
Level 3: Function Calling
Function calling (available in OpenAI, Anthropic tool use, and Google function declarations) is the most reliable pure-API approach before schema-enforced structured output.
OpenAI function calling example:
from openai import OpenAI
import json
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Analyze this customer email: [email]"}],
tools=[{
"type": "function",
"function": {
"name": "extract_email_data",
"description": "Extract structured data from a customer support email",
"parameters": {
"type": "object",
"properties": {
"customer_name": {"type": "string", "description": "Customer's name if mentioned"},
"issue_category": {
"type": "string",
"enum": ["billing", "technical", "account", "other"]
},
"urgency": {
"type": "string",
"enum": ["low", "medium", "high"]
}
},
"required": ["issue_category", "urgency"]
}
}
}],
tool_choice={"type": "function", "function": {"name": "extract_email_data"}}
)
tool_call = response.choices[0].message.tool_calls[0]
data = json.loads(tool_call.function.arguments)
Setting tool_choice to force the specific function prevents the model from choosing to respond with text instead of calling the function.
Level 4: Structured Output with Zod Validation
For TypeScript applications, combining OpenAI's Structured Outputs with Zod gives you API-level format guarantees plus runtime type safety:
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";
const EmailDataSchema = z.object({
customer_name: z.string().nullable(),
issue_category: z.enum(["billing", "technical", "account", "other"]),
urgency: z.enum(["low", "medium", "high"]),
key_details: z.string(),
previous_ticket_mentioned: z.boolean(),
});
const client = new OpenAI();
const response = await client.beta.chat.completions.parse({
model: "gpt-4o-2024-08-06",
messages: [{ role: "user", content: "Analyze this email: [email]" }],
response_format: zodResponseFormat(EmailDataSchema, "email_data"),
});
const emailData = response.choices[0].message.parsed;
// emailData is fully typed as z.infer<typeof EmailDataSchema>
This approach combines the API-level guarantee (the response will match the schema or the API returns an error) with Zod's type inference, giving you a typed object in your application code with no manual parsing.
Error Handling: When the Model Refuses to Format
Even at Level 3 and 4, edge cases exist where the model cannot produce a valid structured output — typically because the input does not contain the required information or the model determines it cannot confidently answer.
Handling strategies:
Retry with error feedback. If parsing fails, send the model's malformed output back with an instruction to fix it: "Your previous response could not be parsed. The error was: [error]. Please return only valid JSON matching the schema."
Fallback to null. For optional fields, instruct the model (in the schema description) to return null rather than making up information: "If the customer's name is not mentioned in the email, return null for customer_name."
Graceful degradation. If structured parsing fails after retries, fall back to free-form text and process it differently. Do not crash — have a defined path for parsing failures.
Choosing the Right Level
Use Level 1 (regex extraction) only for internal tooling or low-stakes use cases where 70% reliability is acceptable.
Use Level 2 (schema in prompt) when you do not have access to function calling (e.g., open-source models or some third-party API wrappers).
Use Level 3 (function calling / tool use) for production applications with OpenAI, Anthropic, or Google where you need 95%+ reliability.
Use Level 4 (structured output + Zod/Pydantic) for production applications where parsing errors are unacceptable and you want full type safety in your application code.
Summary
The path from 0% to 99%+ parseable output is a series of concrete technical choices: include a schema in the prompt, use function calling, enforce schema compliance at the API level, and validate with Zod or Pydantic. Each level adds reliability. The right level for your application depends on your tolerance for parsing failures, your chosen API provider's capabilities, and whether end-to-end type safety matters for your codebase.
Keep Reading
- Structured Output Prompting Guide — deep dive into schema design and output format patterns
- Prompt Engineering for SQL Guide — a domain where parsing reliability is critical for safety
- Prompt Testing Methodology Guide — how to measure your actual parse success rate
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.