LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Dataset preparation, LoRA hyperparameters, Unsloth for faster training, evaluation against the base model, and avoiding catastrophic forgetting and overfitting.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#llm#fine-tuning#lora#unsloth#language-models

FIG. ART-31

10 min read

“

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

// reading plan

sections

1,273

words

min read

// Machine Learning

Ensemble Methods: Why Combining Models Beats Any Individual Model

Bagging, boosting, and stacking -- ensemble methods consistently win Kaggle competitions and improve production accuracy. Here is how each works and when to use them.

9 min read

// Machine Learning

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Fine-tuning a large language model sounds intimidating but has become genuinely accessible with modern tooling. You do not need a cluster of H100s or a team of researchers. A single A100 or even a consumer RTX 4090 can fine-tune a 7B parameter model in hours using LoRA. This guide is a practical walkthrough of the entire process.

When to Fine-Tune vs When Not To

Fine-tuning is appropriate when you need consistent behavior the base model does not reliably provide through prompting alone. Specific use cases where fine-tuning wins:

A specific output format the model needs to follow consistently (JSON with a particular schema, a templated report format)
A particular tone or persona that should be maintained across all responses
Domain-specific knowledge that is not well-represented in the base model's training data (internal company terminology, niche technical domains)
Reducing latency and cost by replacing a large model with a fine-tuned smaller one

Fine-tuning is not the right choice for tasks that can be solved with good system prompts and few-shot examples. Try prompting first. Fine-tuning adds engineering overhead and requires ongoing maintenance. Use it when the task genuinely warrants it.

Dataset Preparation

Dataset format for instruction fine-tuning is JSONL (one JSON object per line). Each example has a system message, a user message, and an assistant response:

{"messages": [{"role": "system", "content": "You are a precise JSON extractor."}, {"role": "user", "content": "Extract the company name and founding year from: Stripe was founded in 2010 by Patrick and John Collison."}, {"role": "assistant", "content": "{"company": "Stripe", "founding_year": 2010}"}]}

Quality over quantity. 100 carefully crafted, high-quality examples will outperform 10,000 noisy, inconsistent ones. The model is already capable of general language understanding — you are teaching it a specific behavior, and noisy training data will confuse that signal.

Guidelines for constructing examples:

Consistency — every example should follow the same format. If some examples use markdown headers and others use plain text for the same type of content, the model will learn inconsistency.

Coverage — cover the range of inputs the model will see in production. Include edge cases, ambiguous inputs, and the examples you care most about getting right.

Negative examples — include examples where the correct response is a refusal or an "I don't know." Without these, the model will hallucinate confidently rather than expressing appropriate uncertainty.

Golden examples — your 10 most important examples should be flawless. These set the quality ceiling. If any golden example is ambiguous or incorrect, fix it before anything else.

LoRA: Fine-Tuning Without Full Weight Updates

Full fine-tuning updates all model weights, which requires storing full-precision gradients for billions of parameters — prohibitively expensive. LoRA (Low-Rank Adaptation) is a parameter-efficient alternative.

Instead of updating the full weight matrix W, LoRA learns two small matrices A and B such that the update is W + AB (where the rank of AB is much smaller than the rank of W). During training, only A and B are updated. During inference, AB is added to the original weights. The original weights remain frozen.

Key LoRA hyperparameters:

Rank (r) — the rank of the adapter matrices. Higher rank = more capacity = more parameters to train. Typical values: 4, 8, 16, 32, 64. Start with 16 for most tasks. Use higher rank (32-64) if the task is complex or if you have many training examples. Use lower rank (4-8) if you want maximum inference speed and have a simple task.

Alpha — a scaling factor, typically set equal to r or to 2r. Alpha/r is the effective scaling applied to the LoRA update. Common practice: set alpha = 2 * r and leave it there.

Dropout — dropout applied to the LoRA layers. 0.05 to 0.1 is typical. Helps prevent overfitting on small datasets.

Target modules — which weight matrices to apply LoRA to. For transformer models: at minimum, the query and value projection matrices. Adding key, output, and MLP layers increases quality but also parameter count. Most training frameworks have sensible defaults.

Using Unsloth for Faster Training

Unsloth is a library that makes LoRA fine-tuning 2x faster and uses 60% less memory than standard implementations through custom CUDA kernels and attention optimizations.

Installation: pip install unsloth. Loading a model with Unsloth:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
)

Unsloth supports Llama 3, Mistral, Phi, Gemma, and other architectures. The 4-bit quantization (load_in_4bit=True) reduces memory by another 2-4x, making 7B and 13B models trainable on a single consumer GPU.

Training Configuration

Once the model and dataset are prepared, training with the Hugging Face SFTTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs",
        optim="adamw_8bit",
    ),
)
trainer.train()

Learning rate for LoRA fine-tuning: 1e-4 to 3e-4. Higher than typical full fine-tuning rates because you are only updating a small number of parameters. Warmup prevents large early updates. 3 epochs is a reasonable default for most datasets of 100-1,000 examples.

Evaluating Against the Base Model

Evaluation is the step most fine-tuning tutorials skip. Do not skip it.

Construct a held-out evaluation set (20-50 examples that were not in training). For each example, generate a response from both the base model and the fine-tuned model. Compare:

Task-specific metric — if you are fine-tuning for JSON extraction, parse both outputs and check if the fine-tuned model produces valid JSON with correct fields more often.
Human evaluation — have a person (or use GPT-4 as a judge) rate the fine-tuned vs base model responses side by side. LLM-as-judge with a structured rubric is surprisingly reliable and much faster than human eval.
Instruction following rate — what percentage of responses follow the format specified in the system prompt exactly?

A fine-tuned model should win clearly on the task you trained for. If it does not, either the dataset is insufficient or the task requires a different approach.

Common Failure Modes

Catastrophic forgetting — the fine-tuned model loses general capabilities it had before fine-tuning. It can no longer follow instructions outside the training domain. This happens when training is too aggressive (too many epochs, too high a learning rate, or a dataset that is too narrow). LoRA significantly reduces catastrophic forgetting compared to full fine-tuning because the base weights are frozen. If you see this, reduce epochs and learning rate.

Overfitting on small datasets — the model memorizes the training examples rather than learning the general pattern. Signs: training loss is very low, but performance on new examples is poor and the model reproduces training examples verbatim. Fix: reduce rank, add dropout, reduce epochs, or add more data.

Training data leaking into responses — the model starts reproducing exact text from your training examples in responses to unrelated prompts. This indicates severe overfitting. The model has memorized rather than generalized. Reduce the number of training epochs dramatically (often 1 epoch is enough for small datasets).

Mode collapse — all responses are very similar, lacking variety. The model has learned to produce the "average" training example. Introduce more variety in your training data, particularly in phrasing and structure.

Keep Reading

How Large Language Models Work — the architecture behind the models you are fine-tuning
Neural Network Training Guide — the training fundamentals that apply to fine-tuning as well
RAG Implementation Guide — often a better alternative to fine-tuning for knowledge injection

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

When to Fine-Tune vs When Not To

Dataset Preparation

LoRA: Fine-Tuning Without Full Weight Updates

Using Unsloth for Faster Training

Training Configuration

Evaluating Against the Base Model

Common Failure Modes

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

When to Fine-Tune vs When Not To

Dataset Preparation

LoRA: Fine-Tuning Without Full Weight Updates

Using Unsloth for Faster Training

Training Configuration

Evaluating Against the Base Model

Common Failure Modes

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

The workspace your team
actually needs