The Goal: 65B Parameters on One GPU
Even with LoRA, fine-tuning a 65B LLM requires storing the frozen base model in memory — roughly 130GB in fp16. The QLoRA paper (arXiv:2305.14314) by Dettmers et al. at the University of Washington solved this with three innovations that together compress the base model to fit on a single 48GB A40 or A100 GPU.
Innovation 1: 4-bit NormalFloat (NF4) Quantization
Standard 4-bit integers allocate bit ranges uniformly, but neural network weights are not uniformly distributed — they follow a roughly normal (Gaussian) distribution. NF4 is an information-theoretically optimal quantization for normally distributed data. It places quantization levels at equal quantiles of the normal distribution, meaning more levels near zero (where weights cluster) and fewer at the extremes.
The result is that NF4 quantization incurs less rounding error than INT4 for neural network weights. Each weight is stored in 4 bits but dequantized to bf16 for computation. The base model goes from ~130GB (fp16) to ~32.5GB (4-bit) — a 4x compression.
Innovation 2: Double Quantization
The quantization constants (the scale factors used to dequantize 4-bit weights back to bf16) are themselves fp32 values. QLoRA quantizes these constants too — quantizing the quantization — reducing the memory overhead of quantization metadata from about 0.5 bits per parameter to about 0.2 bits per parameter.
Innovation 3: Paged Optimizers
GPU memory usage spikes during certain operations (gradient checkpointing, long sequences). These spikes can cause out-of-memory errors even when average memory use fits. Paged optimizers use NVIDIA's unified memory to automatically page optimizer states from GPU to CPU RAM during spikes, and back when needed. This prevents crashes during training without significant throughput loss.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 quantization
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Double quantization
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-65b-hf",
quantization_config=bnb_config,
device_map="auto",
)
lora_config = LoraConfig(r=64, lora_alpha=16, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
The Guanaco Models
The QLoRA paper trained the Guanaco family of models by fine-tuning LLaMA on 9,000 samples of the OASST1 dataset for roughly 24 hours on a single GPU. Guanaco 65B, trained for $300 in cloud GPU costs, reached 99.3% of ChatGPT performance on the Vicuna benchmark — demonstrating that high-quality instruction tuning did not require massive compute budgets.
GPU Memory Requirements
| Model | Base (fp16) | QLoRA (4-bit) | GPU Needed | |-------|------------|---------------|------------| | LLaMA 7B | 14 GB | 5 GB | RTX 3090 | | LLaMA 13B | 26 GB | 10 GB | RTX 4090 | | LLaMA 65B | 130 GB | 35 GB | A100 48GB |