DSPy: Automatic Prompt Optimization for Complex LLM Pipelines

DSPy optimizes LLM prompts automatically using your data. Here is when it helps, when it does not, and a complete setup guide for a real use case.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#dspy#prompt-optimization#llm#ai-frameworks

FIG. ART-26

9 min read

“

DSPy: Automatic Prompt Optimization for Complex LLM Pipelines

// reading plan

sections

857

words

min read

// Machine Learning

GPT Architecture Explained: Beyond the Surface Level

GPT's autoregressive, decoder-only design enables text generation at scale. Here is how it actually works -- from pretraining data to emergent capabilities to GPT-4o.

9 min read

// Machine Learning

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

DSPy (Declarative Self-improving Python) is a framework for building LLM-powered programs that optimizes prompts and few-shot examples automatically, rather than requiring you to hand-write them. Instead of crafting a prompt like "You are a helpful assistant that extracts named entities from text. Here are some examples:...", you define input and output signatures in Python and let DSPy find optimal prompts and examples using your data and a metric. DSPy is most valuable for complex multi-step LLM pipelines where prompt quality significantly impacts output quality and where you have labeled data to optimize against. For simple single-step applications (one LLM call with a straightforward task), the overhead of DSPy rarely justifies itself.

The Core Idea

Standard LLM development has a frustrating property: the best prompt for a task is highly sensitive to the specific model, task formulation, and examples you choose. Prompts that work well for GPT-4o often fail on Claude or Mistral. Prompts that work on a development set sometimes degrade on production traffic. Hand-tuning prompts is time-consuming and does not transfer across models.

DSPy's approach: treat prompts as hyperparameters to optimize rather than code to write. You specify:

The signature (input fields and output fields with descriptions)
The metric (how to evaluate if an output is good)
The training data (labeled examples of good inputs and outputs)

DSPy's optimizer then searches for prompts and few-shot examples that maximize the metric on your training data.

Core Concepts

Signatures: Typed input/output declarations

import dspy

class ExtractEntities(dspy.Signature):
    """Extract named entities from text."""
    text: str = dspy.InputField()
    entities: list[str] = dspy.OutputField(desc="List of named entities (people, organizations, locations)")

Modules: LLM calls with signatures

class EntityExtractor(dspy.Module):
    def __init__(self):
        self.extractor = dspy.Predict(ExtractEntities)

    def forward(self, text):
        return self.extractor(text=text)

Optimizers: Find the best prompts

dspy.BootstrapFewShot: Simple few-shot example selection
dspy.BootstrapFewShotWithRandomSearch: Random search over few-shot examples (better quality, slower)
dspy.MIPROv2: Multi-prompt optimization with Bayesian search (best quality, slowest)

Metrics: Evaluation functions

def entity_accuracy(example, prediction, trace=None):
    expected = set(example.entities)
    predicted = set(prediction.entities)
    precision = len(expected & predicted) / len(predicted) if predicted else 0
    recall = len(expected & predicted) / len(expected) if expected else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    return f1

Complete Example: Optimizing an Entity Extractor

import dspy
from dspy.teleprompt import BootstrapFewShot

# Configure the LLM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Define signature
class ExtractEntities(dspy.Signature):
    """Extract named entities from text."""
    text: str = dspy.InputField()
    entities: list[str] = dspy.OutputField()

# Define module
class EntityExtractor(dspy.Module):
    def __init__(self):
        self.extractor = dspy.Predict(ExtractEntities)

    def forward(self, text):
        return self.extractor(text=text)

# Training data (labeled examples)
trainset = [
    dspy.Example(
        text="Apple CEO Tim Cook announced the new iPhone at WWDC in San Francisco.",
        entities=["Apple", "Tim Cook", "iPhone", "WWDC", "San Francisco"]
    ).with_inputs("text"),
    # ... more examples
]

# Define metric
def entity_f1(example, pred, trace=None):
    expected = set(example.entities)
    predicted = set(pred.entities) if isinstance(pred.entities, list) else set()
    if not predicted:
        return 0
    precision = len(expected & predicted) / len(predicted)
    recall = len(expected & predicted) / len(expected)
    return 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

# Optimize
optimizer = BootstrapFewShot(metric=entity_f1, max_labeled_demos=4)
extractor = EntityExtractor()
optimized = optimizer.compile(extractor, trainset=trainset)

# Use the optimized module
result = optimized(text="Tesla's Elon Musk unveiled new models in Austin, Texas.")
print(result.entities)

When DSPy Helps

Multi-step pipelines where each step depends on the previous one. A RAG system with query rewriting, retrieval, answer generation, and citation verification has four LLM steps. Prompt quality compounds across steps: errors in step 1 propagate through the pipeline. DSPy can jointly optimize all four prompts.

When you are switching between LLM providers. A pipeline optimized for GPT-4o degrades significantly on Mistral 7B. Re-running DSPy optimization with the new LLM takes hours and produces near-optimal prompts for that model. Manual prompt adaptation takes days.

When you have labeled data. DSPy requires training examples with ground truth labels to optimize against. If you have 50-500 labeled examples for your task, DSPy can use them. If you do not have labels, you need a different approach.

When DSPy Does Not Help

Simple single-step applications. If your application calls an LLM once to do a straightforward task, DSPy's overhead (learning the framework, setting up optimization runs, managing labeled datasets) rarely produces better results than a well-written manual prompt.

When you have no labeled data. DSPy requires labeled examples. If you do not have them, you either need to create them (expensive) or use a different optimization approach.

Real-time latency-sensitive applications. DSPy optimization runs take minutes to hours. The optimized program is fast to run, but the optimization process itself is not real-time.

When the task changes frequently. DSPy optimization produces prompts optimized for a fixed task. If your task definition changes often, re-running optimization frequently is impractical.

Keep Reading

LangChain vs LlamaIndex Comparison — Other frameworks for building LLM pipelines
Open Source LLM Benchmarks 2026 — The models DSPy can optimize prompts for
How Large Language Models Work — The underlying mechanics that explain why prompts matter

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

DSPy: Automatic Prompt Optimization for Complex LLM Pipelines

Related Articles

GPT Architecture Explained: Beyond the Surface Level

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

The Core Idea

Core Concepts

Complete Example: Optimizing an Entity Extractor

When DSPy Helps

When DSPy Does Not Help

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Few-Shot Example Selection: How to Choose Examples That Actually Help

DSPy: Automatic Prompt Optimization for Complex LLM Pipelines

Related Articles

GPT Architecture Explained: Beyond the Surface Level

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

The Core Idea

Core Concepts

Complete Example: Optimizing an Entity Extractor

When DSPy Helps

When DSPy Does Not Help

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Few-Shot Example Selection: How to Choose Examples That Actually Help

The workspace your team
actually needs