Why Experiment Tracking Matters for Fine-Tuning
Fine-tuning is iterative. You run with lr=2e-4, then try 5e-4. You experiment with 100 examples, then 500. Without tracking, you lose the context of what you tried and why something worked. Weights & Biases (W&B) captures every hyperparameter, metric, and artifact automatically.
Integrating with HuggingFace Trainer
The simplest integration requires one line:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./outputs",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
report_to="wandb", # this one line
run_name="llama3-qlora-v2",
)
W&B automatically logs training loss, validation loss, learning rate schedule, gradient norms, and system metrics (GPU utilization, memory). No other code changes needed.
Direct Logging
For custom training loops or logging additional metrics:
import wandb
wandb.init(
project="llm-fine-tuning",
config={
"model": "meta-llama/Meta-Llama-3-8B",
"lora_r": 16,
"dataset_size": 500,
"learning_rate": 2e-4,
}
)
for step, batch in enumerate(train_loader):
loss = train_step(batch)
wandb.log({"train/loss": loss, "train/step": step})
wandb.finish()
W&B Weave for LLM Tracing
Weave is W&B's LLM observability layer. It automatically instruments OpenAI and Anthropic API calls when you add two lines:
import weave
from openai import OpenAI
weave.init("my-llm-app") # that's it — OpenAI calls are now traced
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
Every call is logged with model, input, output, latency, token counts, and cost. For agent workflows, you see the full call tree showing how sub-calls compose.
Comparing Fine-Tuned vs Base Model
Weave's Evaluation feature lets you run both models on the same dataset and compare outputs side-by-side with automatic metrics:
import weave
@weave.op()
def evaluate_model(model_name: str, prompt: str) -> str:
# call your model and return the output
...
evaluation = weave.Evaluation(
dataset=test_cases,
scorers=[accuracy_scorer, format_scorer],
)
asyncio.run(evaluation.evaluate(base_model_predict))
asyncio.run(evaluation.evaluate(finetuned_model_predict))
Model Registry
After fine-tuning, register your model artifact for production:
artifact = wandb.Artifact("llama3-customer-support", type="model")
artifact.add_dir("./outputs/final_model")
wandb.log_artifact(artifact)
# Link to registry for deployment tracking
artifact.link("my-org/model-registry/llama3-customer-support", aliases=["v2", "production"])