MLflow 2.x for LLMs: Track Prompts, Responses, and Fine-Tune Runs

MLflow 2.x adds native LLM tracing, prompt versioning, and model registry support - bringing the same experiment discipline from ML training to LLM application development.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 1, 2026

7 min read

// tags

#mlflow#experiment-tracking#llm#fine-tuning#model-registry

FIG. ART-30

7 min read

“

MLflow 2.x for LLMs: Track Prompts, Responses, and Fine-Tune Runs

// reading plan

sections

277

words

min read

// AI Agents

Building reliable agentic AI systems: A Practical Overview

A practical guide to building reliable agentic AI systems covering structured outputs, observability, fallbacks, and cost controls with real code examples.

4 min read

// Developer Tools

What is SpaceX Is Buying Cursor? A Practical Overview

Automatic LLM Tracing

MLflow 2.x auto-patches OpenAI, LangChain, and LlamaIndex:

import mlflow
import openai

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("llm-experiments")

# Enable autologging  -  automatically traces all OpenAI calls
mlflow.openai.autolog()

client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain HNSW indexing"}],
)

Every call appears as a trace in the MLflow UI with input, output, model name, token counts, and latency.

Manual Logging With Runs

with mlflow.start_run(run_name="rag-v2-eval"):
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("chunk_size", 512)
    mlflow.log_param("retrieval_k", 4)

    # Run your RAG evaluation
    scores = evaluate_rag_pipeline()

    mlflow.log_metric("faithfulness", scores["faithfulness"])
    mlflow.log_metric("answer_relevancy", scores["answer_relevancy"])
    mlflow.log_metric("latency_p95_ms", scores["p95_latency"])

Compare run metrics in the MLflow UI to see which prompt or chunking strategy performs best.

MLflow Tracing for LLM Chains

Decorate any function to create a span:

@mlflow.trace(span_type="CHAIN")
def run_rag(question: str) -> str:
    docs = retrieve_docs(question)
    return generate_answer(docs, question)

@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> list[str]:
    # vector search
    return results

@mlflow.trace(span_type="LLM")
def generate_answer(docs: list[str], question: str) -> str:
    # LLM call
    return answer

Model Registry for Fine-Tuned Models

Log a fine-tuned model checkpoint as a registered model:

with mlflow.start_run():
    mlflow.log_artifact("./fine-tuned-llama-3.1-8b/", artifact_path="model")
    mlflow.register_model(
        model_uri=f"runs:/{mlflow.active_run().info.run_id}/model",
        name="llama-3.1-8b-customer-support",
    )

Promote versions through Staging → Production using the UI or API:

client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="llama-3.1-8b-customer-support",
    version=3,
    stage="Production",
)

LangChain Integration

pip install mlflow langchain-openai

mlflow.langchain.autolog()
# All LangChain chain/agent calls are automatically traced

Full documentation at mlflow.org/docs/latest/llms and the GitHub repo.

MLflow 2.x for LLMs: Track Prompts, Responses, and Fine-Tune Runs

Related Articles

Building reliable agentic AI systems: A Practical Overview

What is SpaceX Is Buying Cursor? A Practical Overview

Why MLflow for LLMs

Setup

Automatic LLM Tracing

Manual Logging With Runs

MLflow Tracing for LLM Chains

Model Registry for Fine-Tuned Models

LangChain Integration

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

MLflow 2.x for LLMs: Track Prompts, Responses, and Fine-Tune Runs

Related Articles

Building reliable agentic AI systems: A Practical Overview

What is SpaceX Is Buying Cursor? A Practical Overview

Why MLflow for LLMs

Setup

Automatic LLM Tracing

Manual Logging With Runs

MLflow Tracing for LLM Chains

Model Registry for Fine-Tuned Models

LangChain Integration

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

The workspace your team
actually needs