Prompting for Data Extraction: Structured Output From Unstructured Text

A complete guide to extracting structured data from text with LLMs — field definitions, JSON schemas, function calling for guaranteed structure, missing field handling, batch efficiency, and accuracy limits.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#data-extraction#prompt-engineering#json-schema#function-calling

FIG. ART-35

9 min read

“

Prompting for Data Extraction: Structured Output From Unstructured Text

// reading plan

sections

1,162

words

min read

// Prompt Engineering

Chain of Density Prompting: How to Get Information-Dense Summaries from LLMs

Chain of Density produces better summaries by iteratively densifying a sparse draft. Each pass adds missing information without increasing length. Here is how it works.

8 min read

// Prompt Engineering

Few-Shot Example Selection: How to Choose Examples That Actually Help

Data extraction is one of the highest-value LLM use cases — turning unstructured text into queryable, processable records. It is also one of the easiest to get wrong when the prompt is underspecified. The goal of this guide is to make your extraction prompts precise enough that the output can feed directly into a database or pipeline without manual correction.

The Core Pattern: Specify Field Names and Types

The minimum viable extraction prompt names every field you want and its expected type:

Extract the following fields from the job posting below. Return JSON only.

Fields:
- job_title (string)
- company_name (string)
- location (string, city and country)
- salary_min (number, USD, null if not mentioned)
- salary_max (number, USD, null if not mentioned)
- remote (boolean)
- required_years_experience (number, null if not mentioned)
- required_skills (array of strings)

Job posting:
[text here]

The type annotations serve two purposes: they tell the model what format to use (number, not "five years"), and they set your parsing expectations so you can write deterministic downstream code.

Provide a JSON Schema for Complex Extractions

For more complex structures, provide the exact JSON schema rather than a list of field descriptions. This eliminates ambiguity about nesting, required vs optional fields, and enumerated values:

Extract contract information from the following text. Return a JSON object matching this exact schema:

{
  "parties": [
    {
      "name": "string",
      "role": "buyer" | "seller" | "contractor" | "client",
      "jurisdiction": "string or null"
    }
  ],
  "effective_date": "ISO 8601 date string or null",
  "termination_date": "ISO 8601 date string or null",
  "payment_terms": {
    "amount": "number or null",
    "currency": "string or null",
    "schedule": "string or null"
  },
  "governing_law": "string or null"
}

Contract text:
[text here]

The explicit "string or null" notation in the schema tells the model that missing fields should be null, not omitted from the JSON. This matters because omitted fields cause key errors in downstream code, while null values are handled gracefully.

Function Calling for Guaranteed Structure

When using APIs that support function calling (OpenAI, Anthropic tool use), you can enforce output structure at the API level rather than relying on prompt instructions alone. The model is constrained to return valid JSON matching your schema or fail entirely — there is no "almost valid JSON" case to handle.

Example with OpenAI function calling:

{
  "name": "extract_job_posting",
  "description": "Extract structured data from a job posting",
  "parameters": {
    "type": "object",
    "properties": {
      "job_title": {"type": "string"},
      "company_name": {"type": "string"},
      "salary_min": {"type": ["number", "null"]},
      "salary_max": {"type": ["number", "null"]},
      "remote": {"type": "boolean"},
      "required_skills": {
        "type": "array",
        "items": {"type": "string"}
      }
    },
    "required": ["job_title", "company_name", "remote", "required_skills"]
  }
}

Function calling removes the need to parse or validate the model's output format. Your only concern becomes the accuracy of the extracted values, not their structure.

Handle Missing Fields Explicitly

The instruction "return null if the information is not mentioned" is essential for any field that is not always present. Without it, models fill in plausible values they infer from context, which silently corrupts your data.

Specific missing-field instructions:

Rules for missing information:
- If a field is not explicitly stated in the text, set it to null
- Do NOT infer or guess values. If the salary is not mentioned, salary_min and salary_max must be null
- Do NOT extract information from context clues — only from explicit statements
- If a date is mentioned ambiguously (e.g., "next quarter"), set the date field to null and use a notes field to record the raw text: "notes": "start date: next quarter"

The notes field pattern is particularly useful for preserving information that exists in the text but does not map cleanly to your schema.

Batch Extraction Efficiency

Running one LLM call per document is expensive and slow. For extraction tasks where the schema is fixed, batch multiple documents in a single call:

Extract the following fields from each of the job postings below. Return a JSON array with one object per posting, in the same order as the input.

Fields: [field list]

Posting 1:
[text]

Posting 2:
[text]

Posting 3:
[text]

Practical limits: keep batches under 10-15 documents per call. Larger batches increase error rates because the model must track more context simultaneously. The optimal batch size depends on document length — for short documents (under 200 words), 15-20 per call works well. For long documents, 3-5.

Always verify that the output array length matches the input count. Models occasionally merge adjacent records when documents are similar, dropping one.

Nested and Relational Extraction

Some documents contain repeated structures — a contract with multiple parties, an invoice with multiple line items, a resume with multiple jobs. Handle these with array fields in your schema:

Extract all work experience entries from this resume. Return a JSON array where each entry is:

{
  "company": "string",
  "title": "string",
  "start_date": "YYYY-MM or null",
  "end_date": "YYYY-MM or null, use null for current positions",
  "description": "string, 1-2 sentence summary of responsibilities"
}

Return the array in reverse chronological order (most recent first).

Resume:
[text]

Accuracy Limits and Ambiguous References

Extraction has hard accuracy ceilings for certain types of information:

Ambiguous references: "The company" in a document with multiple companies mentioned. "The date" when several dates appear. Models make reasonable guesses, but guesses are errors in an extraction pipeline.

Implicit information: Information that requires inference across multiple sentences. "The project started in Q1 and ran for 18 months" requires the model to compute an end date. This works most of the time but fails on complex date arithmetic.

Tabular data: Models extract from tables inconsistently, especially when columns are wide or values span cells. For important tables, consider extracting the raw table text separately and prompting specifically against it.

Realistic accuracy expectations:

Explicitly stated, clearly labeled fields: 95-99% accuracy
Fields requiring minor interpretation: 85-95%
Fields requiring inference or disambiguation: 70-85%
Complex date arithmetic or cross-reference resolution: 60-80%

For pipelines requiring 99%+ accuracy on any field, plan for human review on that specific field rather than relying on the model alone.

Post-Processing and Validation

Build validation into your extraction pipeline:

import json
from datetime import datetime

def validate_extraction(result: dict) -> list[str]:
    errors = []

    # Type checks
    if result.get("salary_min") is not None:
        if not isinstance(result["salary_min"], (int, float)):
            errors.append("salary_min must be a number or null")

    # Date format
    if result.get("effective_date"):
        try:
            datetime.fromisoformat(result["effective_date"])
        except ValueError:
            errors.append(f"effective_date is not valid ISO 8601: {result['effective_date']}")

    # Required fields
    for field in ["job_title", "company_name"]:
        if not result.get(field):
            errors.append(f"Required field missing: {field}")

    return errors

Flag records with validation errors for human review rather than silently discarding or passing them through.

Keep Reading

Structured Output Prompting Guide — deep dive into JSON schema enforcement and function calling
Prompt Testing Methodology Guide — building a golden dataset to measure extraction accuracy
Chain-of-Thought Prompting with Examples — when to ask the model to reason before extracting

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Prompting for Data Extraction: Structured Output From Unstructured Text

Related Articles

Chain of Density Prompting: How to Get Information-Dense Summaries from LLMs

The Core Pattern: Specify Field Names and Types

Provide a JSON Schema for Complex Extractions

Function Calling for Guaranteed Structure

Handle Missing Fields Explicitly

Batch Extraction Efficiency

Nested and Relational Extraction

Accuracy Limits and Ambiguous References

Post-Processing and Validation

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Few-Shot Example Selection: How to Choose Examples That Actually Help

ReAct Prompting: How to Make LLMs Reason and Act in Alternating Steps

Prompting for Data Extraction: Structured Output From Unstructured Text

Related Articles

Chain of Density Prompting: How to Get Information-Dense Summaries from LLMs

The Core Pattern: Specify Field Names and Types

Provide a JSON Schema for Complex Extractions

Function Calling for Guaranteed Structure

Handle Missing Fields Explicitly

Batch Extraction Efficiency

Nested and Relational Extraction

Accuracy Limits and Ambiguous References

Post-Processing and Validation

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Few-Shot Example Selection: How to Choose Examples That Actually Help

ReAct Prompting: How to Make LLMs Reason and Act in Alternating Steps

The workspace your team
actually needs