QLoRA: Fine-Tune a 65B LLM on a Single 48GB GPU

QLoRA combines 4-bit quantization with LoRA to make fine-tuning 65B parameter models accessible on consumer hardware, introducing NF4 quantization and paged optimizers.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 10, 2026

9 min read

// tags

#qlora#4-bit#quantization#fine-tuning#nf4

FIG. ART-27

9 min read

“

QLoRA: Fine-Tune a 65B LLM on a Single 48GB GPU

// reading plan

sections

440

words

min read

// Machine Learning

Transfer Learning Explained: Reusing What Neural Networks Already Know

Transfer learning lets you start from a pretrained model instead of random weights. Here is why it works, when to fine-tune vs. freeze layers, and when it fails.

6 min read

// Machine Learning

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

Innovation 2: Double Quantization

The quantization constants (the scale factors used to dequantize 4-bit weights back to bf16) are themselves fp32 values. QLoRA quantizes these constants too - quantizing the quantization - reducing the memory overhead of quantization metadata from about 0.5 bits per parameter to about 0.2 bits per parameter.

Innovation 3: Paged Optimizers

GPU memory usage spikes during certain operations (gradient checkpointing, long sequences). These spikes can cause out-of-memory errors even when average memory use fits. Paged optimizers use NVIDIA's unified memory to automatically page optimizer states from GPU to CPU RAM during spikes, and back when needed. This prevents crashes during training without significant throughput loss.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # Double quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-65b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

lora_config = LoraConfig(r=64, lora_alpha=16, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)

The Guanaco Models

The QLoRA paper trained the Guanaco family of models by fine-tuning LLaMA on 9,000 samples of the OASST1 dataset for roughly 24 hours on a single GPU. Guanaco 65B, trained for $300 in cloud GPU costs, reached 99.3% of ChatGPT performance on the Vicuna benchmark - demonstrating that high-quality instruction tuning did not require massive compute budgets.

GPU Memory Requirements

Model	Base (fp16)	QLoRA (4-bit)	GPU Needed
LLaMA 7B	14 GB	5 GB	RTX 3090
LLaMA 13B	26 GB	10 GB	RTX 4090
LLaMA 65B	130 GB	35 GB	A100 48GB

QLoRA: Fine-Tune a 65B LLM on a Single 48GB GPU

Related Articles

Transfer Learning Explained: Reusing What Neural Networks Already Know

The Goal: 65B Parameters on One GPU

Innovation 1: 4-bit NormalFloat (NF4) Quantization

Innovation 2: Double Quantization

Innovation 3: Paged Optimizers

The Guanaco Models

GPU Memory Requirements

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

ML Model Compression: Pruning, Quantization, and Knowledge Distillation

QLoRA: Fine-Tune a 65B LLM on a Single 48GB GPU

Related Articles

Transfer Learning Explained: Reusing What Neural Networks Already Know

The Goal: 65B Parameters on One GPU

Innovation 1: 4-bit NormalFloat (NF4) Quantization

Innovation 2: Double Quantization

Innovation 3: Paged Optimizers

The Guanaco Models

GPU Memory Requirements

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Fine-Tuning in Practice: A Developer's Complete Walkthrough

ML Model Compression: Pruning, Quantization, and Knowledge Distillation

The workspace your team
actually needs