Fine-tuning a large language model sounds intimidating but has become genuinely accessible with modern tooling. You do not need a cluster of H100s or a team of researchers. A single A100 or even a consumer RTX 4090 can fine-tune a 7B parameter model in hours using LoRA. This guide is a practical walkthrough of the entire process.
When to Fine-Tune vs When Not To
Fine-tuning is appropriate when you need consistent behavior the base model does not reliably provide through prompting alone. Specific use cases where fine-tuning wins:
- A specific output format the model needs to follow consistently (JSON with a particular schema, a templated report format)
- A particular tone or persona that should be maintained across all responses
- Domain-specific knowledge that is not well-represented in the base model's training data (internal company terminology, niche technical domains)
- Reducing latency and cost by replacing a large model with a fine-tuned smaller one
Fine-tuning is not the right choice for tasks that can be solved with good system prompts and few-shot examples. Try prompting first. Fine-tuning adds engineering overhead and requires ongoing maintenance. Use it when the task genuinely warrants it.
Dataset Preparation
Dataset format for instruction fine-tuning is JSONL (one JSON object per line). Each example has a system message, a user message, and an assistant response:
{"messages": [{"role": "system", "content": "You are a precise JSON extractor."}, {"role": "user", "content": "Extract the company name and founding year from: Stripe was founded in 2010 by Patrick and John Collison."}, {"role": "assistant", "content": "{"company": "Stripe", "founding_year": 2010}"}]}
Quality over quantity. 100 carefully crafted, high-quality examples will outperform 10,000 noisy, inconsistent ones. The model is already capable of general language understanding — you are teaching it a specific behavior, and noisy training data will confuse that signal.
Guidelines for constructing examples:
Consistency — every example should follow the same format. If some examples use markdown headers and others use plain text for the same type of content, the model will learn inconsistency.
Coverage — cover the range of inputs the model will see in production. Include edge cases, ambiguous inputs, and the examples you care most about getting right.
Negative examples — include examples where the correct response is a refusal or an "I don't know." Without these, the model will hallucinate confidently rather than expressing appropriate uncertainty.
Golden examples — your 10 most important examples should be flawless. These set the quality ceiling. If any golden example is ambiguous or incorrect, fix it before anything else.
LoRA: Fine-Tuning Without Full Weight Updates
Full fine-tuning updates all model weights, which requires storing full-precision gradients for billions of parameters — prohibitively expensive. LoRA (Low-Rank Adaptation) is a parameter-efficient alternative.
Instead of updating the full weight matrix W, LoRA learns two small matrices A and B such that the update is W + AB (where the rank of AB is much smaller than the rank of W). During training, only A and B are updated. During inference, AB is added to the original weights. The original weights remain frozen.
Key LoRA hyperparameters:
Rank (r) — the rank of the adapter matrices. Higher rank = more capacity = more parameters to train. Typical values: 4, 8, 16, 32, 64. Start with 16 for most tasks. Use higher rank (32-64) if the task is complex or if you have many training examples. Use lower rank (4-8) if you want maximum inference speed and have a simple task.
Alpha — a scaling factor, typically set equal to r or to 2r. Alpha/r is the effective scaling applied to the LoRA update. Common practice: set alpha = 2 * r and leave it there.
Dropout — dropout applied to the LoRA layers. 0.05 to 0.1 is typical. Helps prevent overfitting on small datasets.
Target modules — which weight matrices to apply LoRA to. For transformer models: at minimum, the query and value projection matrices. Adding key, output, and MLP layers increases quality but also parameter count. Most training frameworks have sensible defaults.
Using Unsloth for Faster Training
Unsloth is a library that makes LoRA fine-tuning 2x faster and uses 60% less memory than standard implementations through custom CUDA kernels and attention optimizations.
Installation: pip install unsloth. Loading a model with Unsloth:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_alpha=32,
lora_dropout=0.05,
)
Unsloth supports Llama 3, Mistral, Phi, Gemma, and other architectures. The 4-bit quantization (load_in_4bit=True) reduces memory by another 2-4x, making 7B and 13B models trainable on a single consumer GPU.
Training Configuration
Once the model and dataset are prepared, training with the Hugging Face SFTTrainer:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="outputs",
optim="adamw_8bit",
),
)
trainer.train()
Learning rate for LoRA fine-tuning: 1e-4 to 3e-4. Higher than typical full fine-tuning rates because you are only updating a small number of parameters. Warmup prevents large early updates. 3 epochs is a reasonable default for most datasets of 100-1,000 examples.
Evaluating Against the Base Model
Evaluation is the step most fine-tuning tutorials skip. Do not skip it.
Construct a held-out evaluation set (20-50 examples that were not in training). For each example, generate a response from both the base model and the fine-tuned model. Compare:
- Task-specific metric — if you are fine-tuning for JSON extraction, parse both outputs and check if the fine-tuned model produces valid JSON with correct fields more often.
- Human evaluation — have a person (or use GPT-4 as a judge) rate the fine-tuned vs base model responses side by side. LLM-as-judge with a structured rubric is surprisingly reliable and much faster than human eval.
- Instruction following rate — what percentage of responses follow the format specified in the system prompt exactly?
A fine-tuned model should win clearly on the task you trained for. If it does not, either the dataset is insufficient or the task requires a different approach.
Common Failure Modes
Catastrophic forgetting — the fine-tuned model loses general capabilities it had before fine-tuning. It can no longer follow instructions outside the training domain. This happens when training is too aggressive (too many epochs, too high a learning rate, or a dataset that is too narrow). LoRA significantly reduces catastrophic forgetting compared to full fine-tuning because the base weights are frozen. If you see this, reduce epochs and learning rate.
Overfitting on small datasets — the model memorizes the training examples rather than learning the general pattern. Signs: training loss is very low, but performance on new examples is poor and the model reproduces training examples verbatim. Fix: reduce rank, add dropout, reduce epochs, or add more data.
Training data leaking into responses — the model starts reproducing exact text from your training examples in responses to unrelated prompts. This indicates severe overfitting. The model has memorized rather than generalized. Reduce the number of training epochs dramatically (often 1 epoch is enough for small datasets).
Mode collapse — all responses are very similar, lacking variety. The model has learned to produce the "average" training example. Introduce more variety in your training data, particularly in phrasing and structure.
Keep Reading
- How Large Language Models Work — the architecture behind the models you are fine-tuning
- Neural Network Training Guide — the training fundamentals that apply to fine-tuning as well
- RAG Implementation Guide — often a better alternative to fine-tuning for knowledge injection
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.