DeepEval: Write Unit Tests for LLMs Like You Write Tests for Code

DeepEval integrates with pytest to give LLM responses the same test coverage discipline as regular code - hallucination checks, bias detection, and CI-gated quality gates.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 5, 2026

7 min read

// tags

#deepeval#llm-testing#pytest#hallucination#ci/cd

FIG. ART-25

7 min read

“

DeepEval: Write Unit Tests for LLMs Like You Write Tests for Code

// reading plan

sections

370

words

min read

// Prompt Engineering

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

Treating prompts as code: how to track prompt changes, version them in git, and run automated regression tests on code changes.

10 min read

// AI Evaluation

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

Writing Your First LLM Test

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric

def test_rag_answer_quality():
    test_case = LLMTestCase(
        input="What causes hallucinations in LLMs?",
        actual_output="LLMs hallucinate because they predict likely tokens without a fact-checking mechanism.",
        retrieval_context=[
            "Hallucinations in LLMs occur due to the probabilistic nature of token prediction...",
            "Unlike databases, LLMs have no retrieval fallback when they lack information...",
        ],
    )

    relevancy_metric = AnswerRelevancyMetric(threshold=0.8, model="gpt-4o-mini")
    hallucination_metric = HallucinationMetric(threshold=0.5, model="gpt-4o-mini")

    assert_test(test_case, [relevancy_metric, hallucination_metric])

Run with:

deepeval test run test_llm.py

This integrates seamlessly with pytest - you get pass/fail output and the full pytest ecosystem (fixtures, parametrize, markers).

G-Eval: Custom LLM-as-Judge Metric

G-Eval lets you define a custom evaluation criterion in natural language:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

conciseness = GEval(
    name="Conciseness",
    criteria="The response answers the question without unnecessary padding or repetition.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
)

G-Eval uses chain-of-thought prompting to produce a calibrated 0 - 1 score for any custom criterion.

Built-In Safety Metrics

from deepeval.metrics import BiasMetric, ToxicityMetric

bias = BiasMetric(threshold=0.5)
toxicity = ToxicityMetric(threshold=0.5)

These detect race/gender/political bias and toxic language in model outputs - essential for customer-facing deployments.

Red-Teaming Attacks

deepeval red-team --model gpt-4o-mini --attacks prompt-injection,jailbreak,pii-leakage

DeepEval generates adversarial inputs, runs them against your model, and reports which attacks succeeded. Use this before launching a new model version.

GitHub Actions CI Example

name: LLM Quality Gate
on: [push, pull_request]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      - run: deepeval test run tests/llm/
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

If any metric drops below its threshold, the CI step fails and the PR is blocked - the same discipline that prevents software regressions applied to LLM quality.

Confident AI Dashboard

Sync results to Confident AI cloud for historical metric trending, regression alerts, and team-level dashboards without running your own infrastructure.

DeepEval: Write Unit Tests for LLMs Like You Write Tests for Code

Related Articles

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

Testing LLMs Like Software

Installation

Writing Your First LLM Test

G-Eval: Custom LLM-as-Judge Metric

Built-In Safety Metrics

Red-Teaming Attacks

GitHub Actions CI Example

Confident AI Dashboard

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

Prompt Engineering for Research: What LLMs Can and Cannot Do Reliably

DeepEval: Write Unit Tests for LLMs Like You Write Tests for Code

Related Articles

Prompt Versioning and Evaluation in CI/CD Pipelines: A Practical Guide

Testing LLMs Like Software

Installation

Writing Your First LLM Test

G-Eval: Custom LLM-as-Judge Metric

Built-In Safety Metrics

Red-Teaming Attacks

GitHub Actions CI Example

Confident AI Dashboard

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

SWE-Bench: The Gold Standard for Evaluating LLM Software Engineering

Prompt Engineering for Research: What LLMs Can and Cannot Do Reliably

The workspace your team
actually needs