Fine-Tuning an LLM with QLoRA on a Single GPU

QLoRA makes fine-tuning 70B models accessible on a single consumer GPU. Here is the complete setup guide for fine-tuning Llama 3 with Unsloth.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

10 min read

// tags

#qlora#fine-tuning#llama#unsloth

FIG. ART-26

10 min read

“

Fine-Tuning an LLM with QLoRA on a Single GPU

// reading plan

sections

855

words

min read

// Machine Learning

Transfer Learning Explained: Reusing What Neural Networks Already Know

Transfer learning lets you start from a pretrained model instead of random weights. Here is why it works, when to fine-tune vs. freeze layers, and when it fails.

10 min read

// Machine Learning

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

QLoRA (Quantized Low-Rank Adaptation) is a fine-tuning technique that makes training large language models accessible on consumer and mid-tier hardware. It combines two ideas: 4-bit quantization (reducing the base model's memory footprint by 4x) and LoRA (Low-Rank Adaptation, which adds small trainable weight matrices to the frozen base model). The result: you can fine-tune Llama 3 8B on a single RTX 4090 (24GB VRAM) or Llama 3 70B on a single A100 80GB, workloads that would otherwise require 2-4x more VRAM. Fine-tuning is the right approach when you need to change how a model behaves (its style, format, persona, or domain focus), not when you want to inject new factual knowledge (use RAG for that).

What Fine-Tuning Is Actually For

This distinction is important and frequently misunderstood:

Fine-tuning changes behavior, style, and format. It is effective for: teaching a model to always respond in JSON format, adjusting the tone to match your product's voice, improving performance on a specific structured task (SQL generation, code review), and teaching domain-specific vocabulary and conventions.

Fine-tuning does not inject durable factual knowledge. Training a model on a PDF of your company handbook does not reliably make the model "know" that handbook. The model may appear to learn the facts during training but will hallucinate inconsistently at inference time. For factual knowledge retrieval, use RAG. Fine-tune for format and behavior.

How QLoRA Works

Standard fine-tuning updates all model weights. For Llama 3 8B (8 billion parameters at FP16), that is 16GB just to store the model, plus gradients and optimizer states that typically add 2-4x more memory. Full fine-tuning of 8B models requires 64-128GB VRAM.

QLoRA reduces this through two mechanisms:

4-bit quantization: The frozen base model is loaded in NF4 (Normal Float 4) format, reducing Llama 3 8B from 16GB to ~5GB.

LoRA adapters: Instead of training the full model, LoRA inserts small trainable matrices (rank r, typically 8-64) into each attention layer. These adapters have 1-5% of the parameters of the full model. Only these adapters are trained, requiring minimal memory for gradients.

Combined: fine-tuning Llama 3 8B with QLoRA requires ~8-10GB VRAM. An RTX 3080 (10GB VRAM) or RTX 4070 Ti (12GB VRAM) can handle it.

Dataset Format

For instruction fine-tuning, the standard format is instruction-response pairs in JSONL:

{"instruction": "Classify this customer message as positive, negative, or neutral.", "input": "Your product completely fixed my problem. Amazing.", "output": "positive"}
{"instruction": "Classify this customer message as positive, negative, or neutral.", "input": "Still waiting for the refund I requested 3 weeks ago.", "output": "negative"}

For conversation fine-tuning, the chat format:

{"messages": [{"role": "system", "content": "You are a helpful assistant for a software project management tool."}, {"role": "user", "content": "How do I set up recurring tasks?"}, {"role": "assistant", "content": "To create a recurring task in Zlyqor..."}]}

Dataset size for instruction fine-tuning: 500-5,000 high-quality examples is typically sufficient for behavioral fine-tuning. More is not always better: 100 high-quality examples often outperform 10,000 low-quality ones. Generate your dataset from real examples of the exact behavior you want.

Tools: Unsloth vs Axolotl

Unsloth (github.com/unslothai/unsloth, 25k+ stars) is the fastest QLoRA training library. It implements optimized CUDA kernels for LoRA training that achieve 2-3x faster training and 50-60% less memory usage compared to standard HuggingFace PEFT training. Best for: getting started quickly, single-GPU training, Llama and Mistral models.

Axolotl (github.com/axolotl-org/axolotl, 8k+ stars) is more flexible and configurable. Supports more model families, more training strategies, and better multi-GPU support. Best for: production fine-tuning pipelines, complex training configurations, teams that need reproducible runs.

For a first fine-tuning project, use Unsloth. For production pipelines, evaluate Axolotl.

Complete Fine-Tuning Guide with Unsloth

Installation:

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

Training script:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,           # LoRA rank (higher = more capacity, more memory)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

# Load your dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        output_dir="./outputs",
    ),
)

trainer.train()

# Save the LoRA adapters (not the full model)
model.save_pretrained("./my-fine-tuned-model")

Training time: approximately 1-3 hours for 1,000 examples on an A10G GPU.

After Fine-Tuning: Merging and Serving

LoRA adapters are small files (~20-100MB) that load on top of the base model. For deployment, you can either serve the base model + adapters separately (smaller storage, slight latency overhead) or merge them into a single model.

Merging with Unsloth:

model.save_pretrained_merged("./merged-model", tokenizer, save_method="merged_16bit")

The merged model can be served with vLLM, Ollama, or any other inference server.

Keep Reading

vLLM Serving Guide — Serving your fine-tuned model in production
Hugging Face Complete Guide — Uploading and managing fine-tuned models
Open Source LLM Benchmarks 2026 — Understanding how fine-tuned models compare to base models

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Fine-Tuning an LLM with QLoRA on a Single GPU

Related Articles

Transfer Learning Explained: Reusing What Neural Networks Already Know

What Fine-Tuning Is Actually For

How QLoRA Works

Dataset Format

Tools: Unsloth vs Axolotl

Complete Fine-Tuning Guide with Unsloth

After Fine-Tuning: Merging and Serving

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Ollama Complete Guide 2026: Run Any LLM Locally in 5 Minutes

Fine-Tuning an LLM with QLoRA on a Single GPU

Related Articles

Transfer Learning Explained: Reusing What Neural Networks Already Know

What Fine-Tuning Is Actually For

How QLoRA Works

Dataset Format

Tools: Unsloth vs Axolotl

Complete Fine-Tuning Guide with Unsloth

After Fine-Tuning: Merging and Serving

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Ollama Complete Guide 2026: Run Any LLM Locally in 5 Minutes

The workspace your team
actually needs