QLoRA (Quantized Low-Rank Adaptation) is a fine-tuning technique that makes training large language models accessible on consumer and mid-tier hardware. It combines two ideas: 4-bit quantization (reducing the base model's memory footprint by 4x) and LoRA (Low-Rank Adaptation, which adds small trainable weight matrices to the frozen base model). The result: you can fine-tune Llama 3 8B on a single RTX 4090 (24GB VRAM) or Llama 3 70B on a single A100 80GB, workloads that would otherwise require 2-4x more VRAM. Fine-tuning is the right approach when you need to change how a model behaves (its style, format, persona, or domain focus), not when you want to inject new factual knowledge (use RAG for that).
What Fine-Tuning Is Actually For
This distinction is important and frequently misunderstood:
Fine-tuning changes behavior, style, and format. It is effective for: teaching a model to always respond in JSON format, adjusting the tone to match your product's voice, improving performance on a specific structured task (SQL generation, code review), and teaching domain-specific vocabulary and conventions.
Fine-tuning does not inject durable factual knowledge. Training a model on a PDF of your company handbook does not reliably make the model "know" that handbook. The model may appear to learn the facts during training but will hallucinate inconsistently at inference time. For factual knowledge retrieval, use RAG. Fine-tune for format and behavior.
How QLoRA Works
Standard fine-tuning updates all model weights. For Llama 3 8B (8 billion parameters at FP16), that is 16GB just to store the model, plus gradients and optimizer states that typically add 2-4x more memory. Full fine-tuning of 8B models requires 64-128GB VRAM.
QLoRA reduces this through two mechanisms:
4-bit quantization: The frozen base model is loaded in NF4 (Normal Float 4) format, reducing Llama 3 8B from 16GB to ~5GB.
LoRA adapters: Instead of training the full model, LoRA inserts small trainable matrices (rank r, typically 8-64) into each attention layer. These adapters have 1-5% of the parameters of the full model. Only these adapters are trained, requiring minimal memory for gradients.
Combined: fine-tuning Llama 3 8B with QLoRA requires ~8-10GB VRAM. An RTX 3080 (10GB VRAM) or RTX 4070 Ti (12GB VRAM) can handle it.
Dataset Format
For instruction fine-tuning, the standard format is instruction-response pairs in JSONL:
{"instruction": "Classify this customer message as positive, negative, or neutral.", "input": "Your product completely fixed my problem. Amazing.", "output": "positive"}
{"instruction": "Classify this customer message as positive, negative, or neutral.", "input": "Still waiting for the refund I requested 3 weeks ago.", "output": "negative"}
For conversation fine-tuning, the chat format:
{"messages": [{"role": "system", "content": "You are a helpful assistant for a software project management tool."}, {"role": "user", "content": "How do I set up recurring tasks?"}, {"role": "assistant", "content": "To create a recurring task in Zlyqor..."}]}
Dataset size for instruction fine-tuning: 500-5,000 high-quality examples is typically sufficient for behavioral fine-tuning. More is not always better: 100 high-quality examples often outperform 10,000 low-quality ones. Generate your dataset from real examples of the exact behavior you want.
Tools: Unsloth vs Axolotl
Unsloth (github.com/unslothai/unsloth, 25k+ stars) is the fastest QLoRA training library. It implements optimized CUDA kernels for LoRA training that achieve 2-3x faster training and 50-60% less memory usage compared to standard HuggingFace PEFT training. Best for: getting started quickly, single-GPU training, Llama and Mistral models.
Axolotl (github.com/axolotl-org/axolotl, 8k+ stars) is more flexible and configurable. Supports more model families, more training strategies, and better multi-GPU support. Best for: production fine-tuning pipelines, complex training configurations, teams that need reproducible runs.
For a first fine-tuning project, use Unsloth. For production pipelines, evaluate Axolotl.
Complete Fine-Tuning Guide with Unsloth
Installation:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
Training script:
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (higher = more capacity, more memory)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
)
# Load your dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
output_dir="./outputs",
),
)
trainer.train()
# Save the LoRA adapters (not the full model)
model.save_pretrained("./my-fine-tuned-model")
Training time: approximately 1-3 hours for 1,000 examples on an A10G GPU.
After Fine-Tuning: Merging and Serving
LoRA adapters are small files (~20-100MB) that load on top of the base model. For deployment, you can either serve the base model + adapters separately (smaller storage, slight latency overhead) or merge them into a single model.
Merging with Unsloth:
model.save_pretrained_merged("./merged-model", tokenizer, save_method="merged_16bit")
The merged model can be served with vLLM, Ollama, or any other inference server.
Keep Reading
- vLLM Serving Guide — Serving your fine-tuned model in production
- Hugging Face Complete Guide — Uploading and managing fine-tuned models
- Open Source LLM Benchmarks 2026 — Understanding how fine-tuned models compare to base models
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.