PromptFoo is the most practical open source tool for testing LLM prompts and comparing models because it requires almost no code to get started and produces comparison reports that make tradeoffs immediately visible. You define your prompts and test cases in a YAML file, run a single command, and get a web-based report showing how each prompt-model combination performed. For teams that change prompts frequently and need to know whether each change is an improvement, PromptFoo is the fastest way to build that process.
What PromptFoo Does
PromptFoo runs your test cases against multiple prompts and models simultaneously, then produces a side-by-side comparison report. The core workflow is:
- Write test cases (inputs and expected outputs or assertions) in a YAML config file
- Define multiple prompt variants and multiple models to test against
- Run
promptfoo eval— it runs every test case against every prompt-model combination in parallel - View the results in a web UI or JSON output
The result is a grid: rows are test cases, columns are prompt-model combinations, and each cell shows the actual output and whether it passed your assertions. This makes it easy to see not just which variant wins overall but where specific variants fail on specific cases.
Installation
npm install -g promptfoo
# or use without installing
npx promptfoo@latest
Initialize a new config in your project:
npx promptfoo init
This creates a promptfooconfig.yaml file.
Basic Configuration
A minimal config file that tests two prompt variants against two models:
# promptfooconfig.yaml
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-20241022
prompts:
- "Answer the following customer support question concisely: {{question}}"
- |
You are a helpful customer support agent for Acme Corp.
Answer questions accurately and concisely.
If you do not know the answer, say so.
Question: {{question}}
tests:
- vars:
question: "What is your return policy?"
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "The response should mention a time period for returns"
- vars:
question: "How do I reset my password?"
assert:
- type: contains
value: "settings"
- type: javascript
value: "output.length < 300"
- vars:
question: "What is the airspeed velocity of an unladen swallow?"
assert:
- type: llm-rubric
value: "The response should acknowledge this is not a support question and redirect politely"
Run the eval:
npx promptfoo eval
npx promptfoo view # opens web UI
Assertion Types
PromptFoo supports multiple assertion types that cover different scoring needs:
Exact assertions:
contains— checks if output contains a substringnot-contains— checks output does not contain a substringequals— exact matchregex— matches a regex pattern
Code assertions:
javascript— runs a JS function on the output, returns true/falsepython— runs a Python function on the output
LLM-based assertions:
llm-rubric— uses a judge model to evaluate output against a natural language rubricsimilar— semantic similarity to expected output using embeddingsanswer-relevance— checks if output answers the question
The combination of exact and LLM-based assertions covers most evaluation needs. Use exact assertions for structured outputs (JSON fields, code syntax) and LLM-based assertions for quality judgments.
Integrating Into CI/CD
PromptFoo outputs JSON results, which makes it easy to integrate into a CI pipeline:
# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'promptfooconfig.yaml'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run PromptFoo eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
npx promptfoo@latest eval --output results.json
- name: Check pass rate
run: |
PASS_RATE=$(cat results.json | jq '.stats.successes / .stats.totalTests')
if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
echo "Pass rate $PASS_RATE below threshold 0.85"
exit 1
fi
This blocks merges on pull requests that modify prompts and reduce the pass rate below 85%.
Comparing Models for Cost vs. Quality
One of PromptFoo's most useful features is showing cost per test case alongside quality metrics. When you run evals against both GPT-4o and GPT-4o-mini, the report shows:
- Pass rate for each model
- Average cost per test case for each model
- Latency for each model
This lets you make data-driven decisions about whether the quality improvement from a more expensive model justifies the cost. If GPT-4o passes 94% of test cases and GPT-4o-mini passes 91%, but GPT-4o-mini is 15x cheaper, the right choice for most applications is GPT-4o-mini.
Red Teaming With PromptFoo
PromptFoo has a built-in red team generator:
npx promptfoo redteam init
npx promptfoo redteam run
This automatically generates adversarial test cases covering prompt injection, jailbreaks, and policy violations, then runs them against your configured prompts and models. It is a fast way to get broad adversarial coverage without manually writing hundreds of test cases.
Keep Reading
- Building an LLM Eval From Zero — The underlying approach that PromptFoo implements.
- Evals for Production LLM Apps — How PromptFoo fits into the broader production eval system.
- LLM Red Teaming Guide — How to combine manual red teaming with PromptFoo's automated approach.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.