PromptFoo: The Best Open Source Tool for LLM Prompt Evaluation
PromptFoo lets you define test cases in YAML, run them against multiple models and prompt variants in parallel, and get comparison reports in minutes. Here is a complete setup guide with real-world examples.
PromptFoo is the most practical open source tool for testing LLM prompts and comparing models because it requires almost no code to get started and produces comparison reports that make tradeoffs immediately visible. You define your prompts and test cases in a YAML file, run a single command, and get a web-based report showing how each prompt-model combination performed. For teams that change prompts frequently and need to know whether each change is an improvement, PromptFoo is the fastest way to build that process.
What PromptFoo Does
PromptFoo runs your test cases against multiple prompts and models simultaneously, then produces a side-by-side comparison report. The core workflow is:
Write test cases (inputs and expected outputs or assertions) in a YAML config file
Define multiple prompt variants and multiple models to test against
Run promptfoo eval — it runs every test case against every prompt-model combination in parallel
View the results in a web UI or JSON output
The result is a grid: rows are test cases, columns are prompt-model combinations, and each cell shows the actual output and whether it passed your assertions. This makes it easy to see not just which variant wins overall but where specific variants fail on specific cases.
Installation
npm install -g promptfoo
# or use without installing
npx promptfoo@latest
Initialize a new config in your project:
npx promptfoo init
This creates a promptfooconfig.yaml file.
// stay current
AI & ML insights, weekly
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
A minimal config file that tests two prompt variants against two models:
# promptfooconfig.yaml
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-20241022
prompts:
- "Answer the following customer support question concisely: {{question}}"
- |
You are a helpful customer support agent for Acme Corp.
Answer questions accurately and concisely.
If you do not know the answer, say so.
Question: {{question}}
tests:
- vars:
question: "What is your return policy?"
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "The response should mention a time period for returns"
- vars:
question: "How do I reset my password?"
assert:
- type: contains
value: "settings"
- type: javascript
value: "output.length < 300"
- vars:
question: "What is the airspeed velocity of an unladen swallow?"
assert:
- type: llm-rubric
value: "The response should acknowledge this is not a support question and redirect politely"
Run the eval:
npx promptfoo eval
npx promptfoo view # opens web UI
Assertion Types
PromptFoo supports multiple assertion types that cover different scoring needs:
Exact assertions:
contains — checks if output contains a substring
not-contains — checks output does not contain a substring
equals — exact match
regex — matches a regex pattern
Code assertions:
javascript — runs a JS function on the output, returns true/false
python — runs a Python function on the output
LLM-based assertions:
llm-rubric — uses a judge model to evaluate output against a natural language rubric
similar — semantic similarity to expected output using embeddings
answer-relevance — checks if output answers the question
The combination of exact and LLM-based assertions covers most evaluation needs. Use exact assertions for structured outputs (JSON fields, code syntax) and LLM-based assertions for quality judgments.
Integrating Into CI/CD
PromptFoo outputs JSON results, which makes it easy to integrate into a CI pipeline:
This blocks merges on pull requests that modify prompts and reduce the pass rate below 85%.
Comparing Models for Cost vs. Quality
One of PromptFoo's most useful features is showing cost per test case alongside quality metrics. When you run evals against both GPT-4o and GPT-4o-mini, the report shows:
Pass rate for each model
Average cost per test case for each model
Latency for each model
This lets you make data-driven decisions about whether the quality improvement from a more expensive model justifies the cost. If GPT-4o passes 94% of test cases and GPT-4o-mini passes 91%, but GPT-4o-mini is 15x cheaper, the right choice for most applications is GPT-4o-mini.
Red Teaming With PromptFoo
PromptFoo has a built-in red team generator:
npx promptfoo redteam init
npx promptfoo redteam run
This automatically generates adversarial test cases covering prompt injection, jailbreaks, and policy violations, then runs them against your configured prompts and models. It is a fast way to get broad adversarial coverage without manually writing hundreds of test cases.
Best Practices for PromptFoo
To get the most out of PromptFoo, follow these best practices:
Start with a small set of critical test cases — 5-10 high-quality tests are better than 100 weak ones.
Use a mix of assertion types — combine exact checks with LLM-based rubrics for nuanced evaluation.
Run evals on every prompt change — integrate into CI to catch regressions early.
Compare multiple models — use the cost/quality report to choose the right model for your use case.
Version your config — treat promptfooconfig.yaml like code; review changes in pull requests.
Pricing and Alternatives
PromptFoo is completely free and open source (MIT license). There is no paid tier or usage limit. You only pay for the API calls to the LLM providers you configure (e.g., OpenAI, Anthropic). For teams that need a managed solution, alternatives include:
LangSmith — LangChain's evaluation platform, with a free tier and paid plans starting at $99/month.
Weights & Biases Prompts — integrated with W&B, free for individuals, team plans start at $50/user/month.
Hugging Face Evaluate — open source library for NLP metrics, but less focused on prompt comparison.
PromptFoo remains the best choice for teams that want full control, no vendor lock-in, and a lightweight setup.
Is PromptFoo Worth It in 2026?
Absolutely. PromptFoo has matured into a stable, well-documented tool with an active community. It is worth it because:
Zero cost — no licensing fees, only pay for LLM API usage.
Fast iteration — YAML-based config means you can add tests in seconds.
CI integration — catch regressions before they reach production.
Red teaming — built-in adversarial testing saves hours of manual work.
If you are building any LLM-powered application, PromptFoo is the fastest way to ensure your prompts are reliable and your model choices are data-driven.
LLM Red Teaming Guide — How to combine manual red teaming with PromptFoo's automated approach.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.
Frequently Asked Questions
What is PromptFoo?
PromptFoo is an open source tool for evaluating and comparing LLM prompts and models. You define test cases in a YAML config file, run a single command, and get a side-by-side comparison report showing how each prompt-model combination performs. It supports multiple assertion types, CI/CD integration, and built-in red teaming.
How does PromptFoo work?
PromptFoo works by reading a YAML configuration file that specifies the models (providers), prompt variants, and test cases with assertions. When you run `promptfoo eval`, it executes each test case against every prompt-model combination in parallel, then outputs a grid of results showing pass/fail status, cost, and latency. You can view results in a web UI or export as JSON.
What are the best practices for using PromptFoo?
Best practices include: start with 5-10 critical test cases, use a mix of exact and LLM-based assertions, run evals on every prompt change via CI, compare multiple models for cost vs. quality, and version your config file. Also, leverage the built-in red teaming for adversarial coverage.
How much does PromptFoo cost?
PromptFoo is completely free and open source under the MIT license. There are no paid tiers or usage limits. You only pay for the API calls to the LLM providers you configure (e.g., OpenAI, Anthropic). This makes it cost-effective for teams of any size.
Is PromptFoo worth it in 2026?
Yes, PromptFoo is worth it in 2026. It has matured into a stable, well-documented tool with an active community. It offers zero licensing cost, fast iteration via YAML config, CI integration, and built-in red teaming. For any team building LLM applications, it provides a data-driven way to ensure prompt reliability and optimal model selection.
What assertion types does PromptFoo support?
PromptFoo supports exact assertions (contains, not-contains, equals, regex), code assertions (javascript, python), and LLM-based assertions (llm-rubric, similar, answer-relevance). This mix allows you to validate both structured outputs and nuanced quality criteria.
Can PromptFoo be integrated into CI/CD pipelines?
Yes, PromptFoo outputs JSON results that can be parsed in CI scripts. You can set pass/fail thresholds and block merges if the pass rate drops below a certain percentage. The tool is commonly used in GitHub Actions, GitLab CI, and other CI systems.