PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation

PromptFoo lets you define test cases in YAML, run them against multiple models and prompt variants in parallel, and get comparison reports in minutes. Here is a complete setup guide.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#promptfoo#llm-testing#prompt-evaluation#open-source

FIG. ART-25

8 min read

“

PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation

// reading plan

sections

860

words

min read

// AI Marketing & SEO

Open Sourcing Part of Your Product as a Growth Strategy

PostHog, Supabase, and Cal.com prove that open source drives signups. Here is what to open source, what to keep closed, and how GitHub stars translate to revenue.

9 min read

// Developer Tools

Aider: The Open Source AI Coding Assistant That Works in Your Terminal

PromptFoo is the most practical open source tool for testing LLM prompts and comparing models because it requires almost no code to get started and produces comparison reports that make tradeoffs immediately visible. You define your prompts and test cases in a YAML file, run a single command, and get a web-based report showing how each prompt-model combination performed. For teams that change prompts frequently and need to know whether each change is an improvement, PromptFoo is the fastest way to build that process.

What PromptFoo Does

PromptFoo runs your test cases against multiple prompts and models simultaneously, then produces a side-by-side comparison report. The core workflow is:

Write test cases (inputs and expected outputs or assertions) in a YAML config file
Define multiple prompt variants and multiple models to test against
Run promptfoo eval — it runs every test case against every prompt-model combination in parallel
View the results in a web UI or JSON output

The result is a grid: rows are test cases, columns are prompt-model combinations, and each cell shows the actual output and whether it passed your assertions. This makes it easy to see not just which variant wins overall but where specific variants fail on specific cases.

Installation

npm install -g promptfoo
# or use without installing
npx promptfoo@latest

Initialize a new config in your project:

npx promptfoo init

This creates a promptfooconfig.yaml file.

Basic Configuration

A minimal config file that tests two prompt variants against two models:

# promptfooconfig.yaml
providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-haiku-20241022

prompts:
  - "Answer the following customer support question concisely: {{question}}"
  - |
    You are a helpful customer support agent for Acme Corp.
    Answer questions accurately and concisely.
    If you do not know the answer, say so.

    Question: {{question}}

tests:
  - vars:
      question: "What is your return policy?"
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "The response should mention a time period for returns"

  - vars:
      question: "How do I reset my password?"
    assert:
      - type: contains
        value: "settings"
      - type: javascript
        value: "output.length < 300"

  - vars:
      question: "What is the airspeed velocity of an unladen swallow?"
    assert:
      - type: llm-rubric
        value: "The response should acknowledge this is not a support question and redirect politely"

Run the eval:

npx promptfoo eval
npx promptfoo view  # opens web UI

Assertion Types

PromptFoo supports multiple assertion types that cover different scoring needs:

Exact assertions:

contains — checks if output contains a substring
not-contains — checks output does not contain a substring
equals — exact match
regex — matches a regex pattern

Code assertions:

javascript — runs a JS function on the output, returns true/false
python — runs a Python function on the output

LLM-based assertions:

llm-rubric — uses a judge model to evaluate output against a natural language rubric
similar — semantic similarity to expected output using embeddings
answer-relevance — checks if output answers the question

The combination of exact and LLM-based assertions covers most evaluation needs. Use exact assertions for structured outputs (JSON fields, code syntax) and LLM-based assertions for quality judgments.

Integrating Into CI/CD

PromptFoo outputs JSON results, which makes it easy to integrate into a CI pipeline:

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run PromptFoo eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npx promptfoo@latest eval --output results.json
      - name: Check pass rate
        run: |
          PASS_RATE=$(cat results.json | jq '.stats.successes / .stats.totalTests')
          if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
            echo "Pass rate $PASS_RATE below threshold 0.85"
            exit 1
          fi

This blocks merges on pull requests that modify prompts and reduce the pass rate below 85%.

Comparing Models for Cost vs. Quality

One of PromptFoo's most useful features is showing cost per test case alongside quality metrics. When you run evals against both GPT-4o and GPT-4o-mini, the report shows:

Pass rate for each model
Average cost per test case for each model
Latency for each model

This lets you make data-driven decisions about whether the quality improvement from a more expensive model justifies the cost. If GPT-4o passes 94% of test cases and GPT-4o-mini passes 91%, but GPT-4o-mini is 15x cheaper, the right choice for most applications is GPT-4o-mini.

Red Teaming With PromptFoo

PromptFoo has a built-in red team generator:

npx promptfoo redteam init
npx promptfoo redteam run

This automatically generates adversarial test cases covering prompt injection, jailbreaks, and policy violations, then runs them against your configured prompts and models. It is a fast way to get broad adversarial coverage without manually writing hundreds of test cases.

Keep Reading

Building an LLM Eval From Zero — The underlying approach that PromptFoo implements.
Evals for Production LLM Apps — How PromptFoo fits into the broader production eval system.
LLM Red Teaming Guide — How to combine manual red teaming with PromptFoo's automated approach.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation

Related Articles

Open Sourcing Part of Your Product as a Growth Strategy

Aider: The Open Source AI Coding Assistant That Works in Your Terminal

What PromptFoo Does

Installation

Basic Configuration

Assertion Types

Integrating Into CI/CD

Comparing Models for Cost vs. Quality

Red Teaming With PromptFoo

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Continue.dev: The Open Source AI Coding Extension for VS Code and JetBrains

PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation

Related Articles

Open Sourcing Part of Your Product as a Growth Strategy

Aider: The Open Source AI Coding Assistant That Works in Your Terminal

What PromptFoo Does

Installation

Basic Configuration

Assertion Types

Integrating Into CI/CD

Comparing Models for Cost vs. Quality

Red Teaming With PromptFoo

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Continue.dev: The Open Source AI Coding Extension for VS Code and JetBrains

The workspace your team
actually needs