PromptFoo is an open source tool for evaluating and comparing LLM prompts and models. You define test cases in a YAML config file, run a single command, and get a side-by-side comparison report showing how each prompt-model combination performs. It supports multiple assertion types, CI/CD integration, and built-in red teaming.

How does PromptFoo work?

PromptFoo works by reading a YAML configuration file that specifies the models (providers), prompt variants, and test cases with assertions. When you run `promptfoo eval`, it executes each test case against every prompt-model combination in parallel, then outputs a grid of results showing pass/fail status, cost, and latency. You can view results in a web UI or export as JSON.

What are the best practices for using PromptFoo?

Best practices include: start with 5-10 critical test cases, use a mix of exact and LLM-based assertions, run evals on every prompt change via CI, compare multiple models for cost vs. quality, and version your config file. Also, leverage the built-in red teaming for adversarial coverage.

How much does PromptFoo cost?

PromptFoo is completely free and open source under the MIT license. There are no paid tiers or usage limits. You only pay for the API calls to the LLM providers you configure (e.g., OpenAI, Anthropic). This makes it cost-effective for teams of any size.

What assertion types does PromptFoo support?

PromptFoo supports exact assertions (contains, not-contains, equals, regex), code assertions (javascript, python), and LLM-based assertions (llm-rubric, similar, answer-relevance). This mix allows you to validate both structured outputs and nuanced quality criteria.

Can PromptFoo be integrated into CI/CD pipelines?

Yes, PromptFoo outputs JSON results that can be parsed in CI scripts. You can set pass/fail thresholds and block merges if the pass rate drops below a certain percentage. The tool is commonly used in GitHub Actions, GitLab CI, and other CI systems.

PromptFoo: Best Open Source LLM Prompt Evaluation Tool (2026)

PromptFoo is the most practical open source tool for testing LLM prompts and comparing models because it requires almost no code to get started and produces comparison reports that make tradeoffs immediately visible. You define your prompts and test cases in a YAML file, run a single command, and get a web-based report showing how each prompt-model combination performed. For teams that change prompts frequently and need to know whether each change is an improvement, PromptFoo is the fastest way to build that process.

What PromptFoo Does

PromptFoo runs your test cases against multiple prompts and models simultaneously, then produces a side-by-side comparison report. The core workflow is:

Write test cases (inputs and expected outputs or assertions) in a YAML config file
Define multiple prompt variants and multiple models to test against
Run promptfoo eval — it runs every test case against every prompt-model combination in parallel
View the results in a web UI or JSON output

The result is a grid: rows are test cases, columns are prompt-model combinations, and each cell shows the actual output and whether it passed your assertions. This makes it easy to see not just which variant wins overall but where specific variants fail on specific cases.

Installation

npm install -g promptfoo
# or use without installing
npx promptfoo@latest

Initialize a new config in your project:

npx promptfoo init

This creates a promptfooconfig.yaml file.

Basic Configuration

A minimal config file that tests two prompt variants against two models:

# promptfooconfig.yaml
providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-haiku-20241022

prompts:
  - "Answer the following customer support question concisely: {{question}}"
  - |
    You are a helpful customer support agent for Acme Corp.
    Answer questions accurately and concisely.
    If you do not know the answer, say so.

    Question: {{question}}

tests:
  - vars:
      question: "What is your return policy?"
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "The response should mention a time period for returns"

  - vars:
      question: "How do I reset my password?"
    assert:
      - type: contains
        value: "settings"
      - type: javascript
        value: "output.length < 300"

  - vars:
      question: "What is the airspeed velocity of an unladen swallow?"
    assert:
      - type: llm-rubric
        value: "The response should acknowledge this is not a support question and redirect politely"

Run the eval:

npx promptfoo eval
npx promptfoo view  # opens web UI

Assertion Types

PromptFoo supports multiple assertion types that cover different scoring needs:

Exact assertions:

contains — checks if output contains a substring
not-contains — checks output does not contain a substring
equals — exact match
regex — matches a regex pattern

Code assertions:

javascript — runs a JS function on the output, returns true/false
python — runs a Python function on the output

LLM-based assertions:

llm-rubric — uses a judge model to evaluate output against a natural language rubric
similar — semantic similarity to expected output using embeddings
answer-relevance — checks if output answers the question

The combination of exact and LLM-based assertions covers most evaluation needs. Use exact assertions for structured outputs (JSON fields, code syntax) and LLM-based assertions for quality judgments.

Integrating Into CI/CD

PromptFoo outputs JSON results, which makes it easy to integrate into a CI pipeline:

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run PromptFoo eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npx promptfoo@latest eval --output results.json
      - name: Check pass rate
        run: |
          PASS_RATE=$(cat results.json | jq '.stats.successes / .stats.totalTests')
          if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
            echo "Pass rate $PASS_RATE below threshold 0.85"
            exit 1
          fi

This blocks merges on pull requests that modify prompts and reduce the pass rate below 85%.

Comparing Models for Cost vs. Quality

One of PromptFoo's most useful features is showing cost per test case alongside quality metrics. When you run evals against both GPT-4o and GPT-4o-mini, the report shows:

Pass rate for each model
Average cost per test case for each model
Latency for each model

This lets you make data-driven decisions about whether the quality improvement from a more expensive model justifies the cost. If GPT-4o passes 94% of test cases and GPT-4o-mini passes 91%, but GPT-4o-mini is 15x cheaper, the right choice for most applications is GPT-4o-mini.

Red Teaming With PromptFoo

PromptFoo has a built-in red team generator:

npx promptfoo redteam init
npx promptfoo redteam run

This automatically generates adversarial test cases covering prompt injection, jailbreaks, and policy violations, then runs them against your configured prompts and models. It is a fast way to get broad adversarial coverage without manually writing hundreds of test cases.

Best Practices for PromptFoo

To get the most out of PromptFoo, follow these best practices:

Start with a small set of critical test cases — 5-10 high-quality tests are better than 100 weak ones.
Use a mix of assertion types — combine exact checks with LLM-based rubrics for nuanced evaluation.
Run evals on every prompt change — integrate into CI to catch regressions early.
Compare multiple models — use the cost/quality report to choose the right model for your use case.
Version your config — treat promptfooconfig.yaml like code; review changes in pull requests.

Pricing and Alternatives

PromptFoo is completely free and open source (MIT license). There is no paid tier or usage limit. You only pay for the API calls to the LLM providers you configure (e.g., OpenAI, Anthropic). For teams that need a managed solution, alternatives include:

LangSmith — LangChain's evaluation platform, with a free tier and paid plans starting at $99/month.
Weights & Biases Prompts — integrated with W&B, free for individuals, team plans start at $50/user/month.
Hugging Face Evaluate — open source library for NLP metrics, but less focused on prompt comparison.

PromptFoo remains the best choice for teams that want full control, no vendor lock-in, and a lightweight setup.

Is PromptFoo Worth It in 2026?

Absolutely. PromptFoo has matured into a stable, well-documented tool with an active community. It is worth it because:

Zero cost — no licensing fees, only pay for LLM API usage.
Fast iteration — YAML-based config means you can add tests in seconds.
CI integration — catch regressions before they reach production.
Red teaming — built-in adversarial testing saves hours of manual work.

If you are building any LLM-powered application, PromptFoo is the fastest way to ensure your prompts are reliable and your model choices are data-driven.

Keep Reading

Building an LLM Eval From Zero — The underlying approach that PromptFoo implements.
Evals for Production LLM Apps — How PromptFoo fits into the broader production eval system.
LLM Red Teaming Guide — How to combine manual red teaming with PromptFoo's automated approach.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation

What PromptFoo Does

Installation

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Aider: The Open Source AI Coding Assistant That Works in Your Terminal

Continue.dev: The Open Source AI Coding Extension for VS Code and JetBrains

Basic Configuration

Assertion Types

Integrating Into CI/CD

Comparing Models for Cost vs. Quality

Red Teaming With PromptFoo

Best Practices for PromptFoo

Pricing and Alternatives

Is PromptFoo Worth It in 2026?

Keep Reading

Frequently Asked Questions

What is PromptFoo?

How does PromptFoo work?

What are the best practices for using PromptFoo?

How much does PromptFoo cost?

Is PromptFoo worth it in 2026?

What assertion types does PromptFoo support?

Can PromptFoo be integrated into CI/CD pipelines?

The workspace your team
actually needs

PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation

What PromptFoo Does

Installation

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Aider: The Open Source AI Coding Assistant That Works in Your Terminal

Continue.dev: The Open Source AI Coding Extension for VS Code and JetBrains

Basic Configuration

Assertion Types

Integrating Into CI/CD

Comparing Models for Cost vs. Quality

Red Teaming With PromptFoo

Best Practices for PromptFoo

Pricing and Alternatives

Is PromptFoo Worth It in 2026?

Keep Reading

Frequently Asked Questions

What is PromptFoo?

How does PromptFoo work?

What are the best practices for using PromptFoo?

How much does PromptFoo cost?

Is PromptFoo worth it in 2026?

What assertion types does PromptFoo support?

Can PromptFoo be integrated into CI/CD pipelines?

The workspace your teamactually needs

The workspace your team
actually needs