Testing LLMs Like Software
When you ship a software change, tests catch regressions. When you change a prompt or swap a model, there is typically nothing to catch quality regressions. DeepEval fixes this by turning LLM quality checks into pytest tests that run in CI exactly like unit tests.
Installation
pip install deepeval
deepeval login # Optional: connect to Confident AI dashboard
Writing Your First LLM Test
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric
def test_rag_answer_quality():
test_case = LLMTestCase(
input="What causes hallucinations in LLMs?",
actual_output="LLMs hallucinate because they predict likely tokens without a fact-checking mechanism.",
retrieval_context=[
"Hallucinations in LLMs occur due to the probabilistic nature of token prediction...",
"Unlike databases, LLMs have no retrieval fallback when they lack information...",
],
)
relevancy_metric = AnswerRelevancyMetric(threshold=0.8, model="gpt-4o-mini")
hallucination_metric = HallucinationMetric(threshold=0.5, model="gpt-4o-mini")
assert_test(test_case, [relevancy_metric, hallucination_metric])
Run with:
deepeval test run test_llm.py
This integrates seamlessly with pytest — you get pass/fail output and the full pytest ecosystem (fixtures, parametrize, markers).
G-Eval: Custom LLM-as-Judge Metric
G-Eval lets you define a custom evaluation criterion in natural language:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
conciseness = GEval(
name="Conciseness",
criteria="The response answers the question without unnecessary padding or repetition.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
)
G-Eval uses chain-of-thought prompting to produce a calibrated 0–1 score for any custom criterion.
Built-In Safety Metrics
from deepeval.metrics import BiasMetric, ToxicityMetric
bias = BiasMetric(threshold=0.5)
toxicity = ToxicityMetric(threshold=0.5)
These detect race/gender/political bias and toxic language in model outputs — essential for customer-facing deployments.
Red-Teaming Attacks
deepeval red-team --model gpt-4o-mini --attacks prompt-injection,jailbreak,pii-leakage
DeepEval generates adversarial inputs, runs them against your model, and reports which attacks succeeded. Use this before launching a new model version.
GitHub Actions CI Example
name: LLM Quality Gate
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install deepeval
- run: deepeval test run tests/llm/
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
If any metric drops below its threshold, the CI step fails and the PR is blocked — the same discipline that prevents software regressions applied to LLM quality.
Confident AI Dashboard
Sync results to Confident AI cloud for historical metric trending, regression alerts, and team-level dashboards without running your own infrastructure.