Prompting for Data Extraction: Structured Output From Unstructured Text
A complete guide to extracting structured data from text with LLMs - field definitions, JSON schemas, function calling for guaranteed structure, missing field handling, batch efficiency, and accuracy limits.
Data extraction is one of the highest-value LLM use cases - turning unstructured text into queryable, processable records. It is also one of the easiest to get wrong when the prompt is underspecified. The goal of this guide is to make your extraction prompts precise enough that the output can feed directly into a database or pipeline without manual correction.
The Core Pattern: Specify Field Names and Types
The minimum viable extraction prompt names every field you want and its expected type:
Extract the following fields from the job posting below. Return JSON only.
Fields:
- job_title (string)
- company_name (string)
- location (string, city and country)
- salary_min (number, USD, null if not mentioned)
- salary_max (number, USD, null if not mentioned)
- remote (boolean)
- required_years_experience (number, null if not mentioned)
- required_skills (array of strings)
Job posting:
[text here]
The type annotations serve two purposes: they tell the model what format to use (number, not "five years"), and they set your parsing expectations so you can write deterministic downstream code.
Provide a JSON Schema for Complex Extractions
For more complex structures, provide the exact JSON schema rather than a list of field descriptions. This eliminates ambiguity about nesting, required vs optional fields, and enumerated values:
Extract contract information from the following text. Return a JSON object matching this exact schema:
{
"parties": [
{
"name": "string",
"role": "buyer" | "seller" | "contractor" | "client",
"jurisdiction": "string or null"
}
],
"effective_date": "ISO 8601 date string or null",
"termination_date": "ISO 8601 date string or null",
"payment_terms": {
"amount": "number or null",
"currency": "string or null",
"schedule": "string or null"
},
"governing_law": "string or null"
}
Contract text:
[text here]
The explicit "string or null" notation in the schema tells the model that missing fields should be null, not omitted from the JSON. This matters because omitted fields cause key errors in downstream code, while null values are handled gracefully.
// stay current
AI & ML insights, weekly
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
When using APIs that support function calling (OpenAI, Anthropic tool use), you can enforce output structure at the API level rather than relying on prompt instructions alone. The model is constrained to return valid JSON matching your schema or fail entirely - there is no "almost valid JSON" case to handle.
Function calling removes the need to parse or validate the model's output format. Your only concern becomes the accuracy of the extracted values, not their structure.
Handle Missing Fields Explicitly
The instruction "return null if the information is not mentioned" is essential for any field that is not always present. Without it, models fill in plausible values they infer from context, which silently corrupts your data.
Specific missing-field instructions:
Rules for missing information:
- If a field is not explicitly stated in the text, set it to null
- Do NOT infer or guess values. If the salary is not mentioned, salary_min and salary_max must be null
- Do NOT extract information from context clues - only from explicit statements
- If a date is mentioned ambiguously (e.g., "next quarter"), set the date field to null and use a notes field to record the raw text: "notes": "start date: next quarter"
The notes field pattern is particularly useful for preserving information that exists in the text but does not map cleanly to your schema.
Batch Extraction Efficiency
Running one LLM call per document is expensive and slow. For extraction tasks where the schema is fixed, batch multiple documents in a single call:
Extract the following fields from each of the job postings below. Return a JSON array with one object per posting, in the same order as the input.
Fields: [field list]
Posting 1:
[text]
Posting 2:
[text]
Posting 3:
[text]
Practical limits: keep batches under 10-15 documents per call. Larger batches increase error rates because the model must track more context simultaneously. The optimal batch size depends on document length - for short documents (under 200 words), 15-20 per call works well. For long documents, 3-5.
Always verify that the output array length matches the input count. Models occasionally merge adjacent records when documents are similar, dropping one.
Nested and Relational Extraction
Some documents contain repeated structures - a contract with multiple parties, an invoice with multiple line items, a resume with multiple jobs. Handle these with array fields in your schema:
Extract all work experience entries from this resume. Return a JSON array where each entry is:
{
"company": "string",
"title": "string",
"start_date": "YYYY-MM or null",
"end_date": "YYYY-MM or null, use null for current positions",
"description": "string, 1-2 sentence summary of responsibilities"
}
Return the array in reverse chronological order (most recent first).
Resume:
[text]
Accuracy Limits and Ambiguous References
Extraction has hard accuracy ceilings for certain types of information:
Ambiguous references: "The company" in a document with multiple companies mentioned. "The date" when several dates appear. Models make reasonable guesses, but guesses are errors in an extraction pipeline.
Implicit information: Information that requires inference across multiple sentences. "The project started in Q1 and ran for 18 months" requires the model to compute an end date. This works most of the time but fails on complex date arithmetic.
Tabular data: Models extract from tables inconsistently, especially when columns are wide or values span cells. For important tables, consider extracting the raw table text separately and prompting specifically against it.
Fields requiring inference or disambiguation: 70-85%
Complex date arithmetic or cross-reference resolution: 60-80%
For pipelines requiring 99%+ accuracy on any field, plan for human review on that specific field rather than relying on the model alone.
Post-Processing and Validation
Build validation into your extraction pipeline:
import json
from datetime import datetime
def validate_extraction(result: dict) -> list[str]:
errors = []
# Type checks
if result.get("salary_min") is not None:
if not isinstance(result["salary_min"], (int, float)):
errors.append("salary_min must be a number or null")
# Date format
if result.get("effective_date"):
try:
datetime.fromisoformat(result["effective_date"])
except ValueError:
errors.append(f"effective_date is not valid ISO 8601: {result['effective_date']}")
# Required fields
for field in ["job_title", "company_name"]:
if not result.get(field):
errors.append(f"Required field missing: {field}")
return errors
Flag records with validation errors for human review rather than silently discarding or passing them through.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.
Frequently Asked Questions
What is prompting for data extraction?
Prompting for data extraction is the practice of instructing an LLM to pull specific structured information from unstructured text. You define fields (e.g., name, date, amount) and their types, and the model returns a JSON object. This turns free-form text into data you can query, store, or process programmatically.
How does prompting for data extraction work?
You provide a prompt that lists the fields you want (with types and constraints) and the source text. The LLM reads the text and outputs a JSON object matching your schema. For guaranteed structure, you can use function calling (OpenAI, Anthropic) which forces valid JSON output. The model identifies relevant text spans and maps them to your fields, returning null for missing information.
What are the best practices for prompting data extraction?
Key best practices: (1) Always specify field names and types. (2) Provide a full JSON schema for complex structures. (3) Use function calling when possible to enforce output format. (4) Explicitly instruct the model to return null for missing fields. (5) Batch multiple documents in one call for efficiency. (6) Validate outputs with post-processing checks. (7) For ambiguous or implicit information, use a notes field to preserve raw text.
How much does prompting for data extraction cost?
Cost depends on the LLM provider and volume. For OpenAI's GPT-4o, a single extraction of a short document (500 tokens) costs ~$0.002. For GPT-4o-mini, it's ~$0.00015. Batch extraction reduces per-document cost. At scale, expect $0.001–$0.01 per document for most use cases. Function calling adds no extra cost beyond token usage.
Is prompting for data extraction worth it in 2026?
Yes, for many use cases. Accuracy for explicit fields is 95–99%, and cost is low. It's ideal for automating data entry, processing invoices, extracting job postings, or parsing contracts. However, for fields requiring inference or disambiguation, accuracy drops to 70–85%, so human review may still be needed. For 99%+ accuracy on critical fields, combine LLM extraction with validation rules and manual oversight.
How do I handle missing fields in extraction?
Explicitly instruct the model to return null for any field not mentioned in the text. Use phrases like 'If not mentioned, set to null' and 'Do not infer or guess values.' In your schema, mark optional fields as 'string or null' or 'number or null'. This prevents the model from fabricating data and ensures downstream code can handle missing values gracefully.
What is the difference between JSON schema and function calling for extraction?
JSON schema in the prompt guides the model's output format but does not enforce it—the model might still return malformed JSON. Function calling (e.g., OpenAI's tool use) enforces the schema at the API level: the model must return valid JSON matching the parameters or the call fails. Function calling is more reliable for production pipelines, while JSON schema in prompts is simpler for prototyping.