Data extraction is one of the highest-value LLM use cases — turning unstructured text into queryable, processable records. It is also one of the easiest to get wrong when the prompt is underspecified. The goal of this guide is to make your extraction prompts precise enough that the output can feed directly into a database or pipeline without manual correction.
The Core Pattern: Specify Field Names and Types
The minimum viable extraction prompt names every field you want and its expected type:
Extract the following fields from the job posting below. Return JSON only.
Fields:
- job_title (string)
- company_name (string)
- location (string, city and country)
- salary_min (number, USD, null if not mentioned)
- salary_max (number, USD, null if not mentioned)
- remote (boolean)
- required_years_experience (number, null if not mentioned)
- required_skills (array of strings)
Job posting:
[text here]
The type annotations serve two purposes: they tell the model what format to use (number, not "five years"), and they set your parsing expectations so you can write deterministic downstream code.
Provide a JSON Schema for Complex Extractions
For more complex structures, provide the exact JSON schema rather than a list of field descriptions. This eliminates ambiguity about nesting, required vs optional fields, and enumerated values:
Extract contract information from the following text. Return a JSON object matching this exact schema:
{
"parties": [
{
"name": "string",
"role": "buyer" | "seller" | "contractor" | "client",
"jurisdiction": "string or null"
}
],
"effective_date": "ISO 8601 date string or null",
"termination_date": "ISO 8601 date string or null",
"payment_terms": {
"amount": "number or null",
"currency": "string or null",
"schedule": "string or null"
},
"governing_law": "string or null"
}
Contract text:
[text here]
The explicit "string or null" notation in the schema tells the model that missing fields should be null, not omitted from the JSON. This matters because omitted fields cause key errors in downstream code, while null values are handled gracefully.
Function Calling for Guaranteed Structure
When using APIs that support function calling (OpenAI, Anthropic tool use), you can enforce output structure at the API level rather than relying on prompt instructions alone. The model is constrained to return valid JSON matching your schema or fail entirely — there is no "almost valid JSON" case to handle.
Example with OpenAI function calling:
{
"name": "extract_job_posting",
"description": "Extract structured data from a job posting",
"parameters": {
"type": "object",
"properties": {
"job_title": {"type": "string"},
"company_name": {"type": "string"},
"salary_min": {"type": ["number", "null"]},
"salary_max": {"type": ["number", "null"]},
"remote": {"type": "boolean"},
"required_skills": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["job_title", "company_name", "remote", "required_skills"]
}
}
Function calling removes the need to parse or validate the model's output format. Your only concern becomes the accuracy of the extracted values, not their structure.
Handle Missing Fields Explicitly
The instruction "return null if the information is not mentioned" is essential for any field that is not always present. Without it, models fill in plausible values they infer from context, which silently corrupts your data.
Specific missing-field instructions:
Rules for missing information:
- If a field is not explicitly stated in the text, set it to null
- Do NOT infer or guess values. If the salary is not mentioned, salary_min and salary_max must be null
- Do NOT extract information from context clues — only from explicit statements
- If a date is mentioned ambiguously (e.g., "next quarter"), set the date field to null and use a notes field to record the raw text: "notes": "start date: next quarter"
The notes field pattern is particularly useful for preserving information that exists in the text but does not map cleanly to your schema.
Batch Extraction Efficiency
Running one LLM call per document is expensive and slow. For extraction tasks where the schema is fixed, batch multiple documents in a single call:
Extract the following fields from each of the job postings below. Return a JSON array with one object per posting, in the same order as the input.
Fields: [field list]
Posting 1:
[text]
Posting 2:
[text]
Posting 3:
[text]
Practical limits: keep batches under 10-15 documents per call. Larger batches increase error rates because the model must track more context simultaneously. The optimal batch size depends on document length — for short documents (under 200 words), 15-20 per call works well. For long documents, 3-5.
Always verify that the output array length matches the input count. Models occasionally merge adjacent records when documents are similar, dropping one.
Nested and Relational Extraction
Some documents contain repeated structures — a contract with multiple parties, an invoice with multiple line items, a resume with multiple jobs. Handle these with array fields in your schema:
Extract all work experience entries from this resume. Return a JSON array where each entry is:
{
"company": "string",
"title": "string",
"start_date": "YYYY-MM or null",
"end_date": "YYYY-MM or null, use null for current positions",
"description": "string, 1-2 sentence summary of responsibilities"
}
Return the array in reverse chronological order (most recent first).
Resume:
[text]
Accuracy Limits and Ambiguous References
Extraction has hard accuracy ceilings for certain types of information:
Ambiguous references: "The company" in a document with multiple companies mentioned. "The date" when several dates appear. Models make reasonable guesses, but guesses are errors in an extraction pipeline.
Implicit information: Information that requires inference across multiple sentences. "The project started in Q1 and ran for 18 months" requires the model to compute an end date. This works most of the time but fails on complex date arithmetic.
Tabular data: Models extract from tables inconsistently, especially when columns are wide or values span cells. For important tables, consider extracting the raw table text separately and prompting specifically against it.
Realistic accuracy expectations:
- Explicitly stated, clearly labeled fields: 95-99% accuracy
- Fields requiring minor interpretation: 85-95%
- Fields requiring inference or disambiguation: 70-85%
- Complex date arithmetic or cross-reference resolution: 60-80%
For pipelines requiring 99%+ accuracy on any field, plan for human review on that specific field rather than relying on the model alone.
Post-Processing and Validation
Build validation into your extraction pipeline:
import json
from datetime import datetime
def validate_extraction(result: dict) -> list[str]:
errors = []
# Type checks
if result.get("salary_min") is not None:
if not isinstance(result["salary_min"], (int, float)):
errors.append("salary_min must be a number or null")
# Date format
if result.get("effective_date"):
try:
datetime.fromisoformat(result["effective_date"])
except ValueError:
errors.append(f"effective_date is not valid ISO 8601: {result['effective_date']}")
# Required fields
for field in ["job_title", "company_name"]:
if not result.get(field):
errors.append(f"Required field missing: {field}")
return errors
Flag records with validation errors for human review rather than silently discarding or passing them through.
Keep Reading
- Structured Output Prompting Guide — deep dive into JSON schema enforcement and function calling
- Prompt Testing Methodology Guide — building a golden dataset to measure extraction accuracy
- Chain-of-Thought Prompting with Examples — when to ask the model to reason before extracting
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.