Weights & Biases for LLM Fine-Tuning: Track Every Run and Compare Results

W&B provides experiment tracking for fine-tuning runs and LLM tracing via Weave, letting you compare models, trace agent calls, and manage models in a production registry.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 22, 2026

8 min read

// tags

#weights-&-biases#mlops#fine-tuning#experiment-tracking#weave

FIG. ART-29

8 min read

“

Weights & Biases for LLM Fine-Tuning: Track Every Run and Compare Results

// reading plan

sections

302

words

min read

// Developer Tools

What is SpaceX Is Buying Cursor? A Practical Overview

SpaceX is buying Cursor, the AI-powered code editor. The deal signals a shift in how AI coding tools are valued and deployed. Here's a practical breakdown of what's happening and what it means for developers.

4 min read

// Developer Tools

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Direct Logging

For custom training loops or logging additional metrics:

import wandb

wandb.init(
    project="llm-fine-tuning",
    config={
        "model": "meta-llama/Meta-Llama-3-8B",
        "lora_r": 16,
        "dataset_size": 500,
        "learning_rate": 2e-4,
    }
)

for step, batch in enumerate(train_loader):
    loss = train_step(batch)
    wandb.log({"train/loss": loss, "train/step": step})

wandb.finish()

W&B Weave for LLM Tracing

Weave is W&B's LLM observability layer. It automatically instruments OpenAI and Anthropic API calls when you add two lines:

import weave
from openai import OpenAI

weave.init("my-llm-app")   # that's it  -  OpenAI calls are now traced

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

Every call is logged with model, input, output, latency, token counts, and cost. For agent workflows, you see the full call tree showing how sub-calls compose.

Comparing Fine-Tuned vs Base Model

Weave's Evaluation feature lets you run both models on the same dataset and compare outputs side-by-side with automatic metrics:

import weave

@weave.op()
def evaluate_model(model_name: str, prompt: str) -> str:
    # call your model and return the output
    ...

evaluation = weave.Evaluation(
    dataset=test_cases,
    scorers=[accuracy_scorer, format_scorer],
)
asyncio.run(evaluation.evaluate(base_model_predict))
asyncio.run(evaluation.evaluate(finetuned_model_predict))

Model Registry

After fine-tuning, register your model artifact for production:

artifact = wandb.Artifact("llama3-customer-support", type="model")
artifact.add_dir("./outputs/final_model")
wandb.log_artifact(artifact)

# Link to registry for deployment tracking
artifact.link("my-org/model-registry/llama3-customer-support", aliases=["v2", "production"])

Weights & Biases for LLM Fine-Tuning: Track Every Run and Compare Results

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

Why Experiment Tracking Matters for Fine-Tuning

Integrating with HuggingFace Trainer

Direct Logging

W&B Weave for LLM Tracing

Comparing Fine-Tuned vs Base Model

Model Registry

Resources

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

Weights & Biases for LLM Fine-Tuning: Track Every Run and Compare Results

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

Why Experiment Tracking Matters for Fine-Tuning

Integrating with HuggingFace Trainer

Direct Logging

W&B Weave for LLM Tracing

Comparing Fine-Tuned vs Base Model

Model Registry

Resources

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

The workspace your team
actually needs