The Problem: Full Fine-Tuning Is Prohibitively Expensive
Fine-tuning a 175B parameter model like GPT-3 requires storing the full parameter gradients and optimizer states. With Adam, that is roughly 175B parameters × 4 bytes (fp32) × 3 (params + two Adam states) = over 2TB of GPU memory. Even with mixed precision, you need hundreds of GB across dozens of A100s. This is not accessible for most teams.
The LoRA Insight: Weight Updates Are Low-Rank
The key observation in the LoRA paper (arXiv:2106.09685) by Hu et al. is that the weight update matrix during fine-tuning has a low intrinsic rank. Rather than learning a full d×d update, you can decompose it as:
W = W_0 + BA
where W_0 is the frozen pretrained weight, B is d×r, and A is r×d, with rank r much smaller than d. For a 4096×4096 weight matrix with r=8, that is 4096×4096 = 16.7M parameters reduced to 2×(4096×8) = 65K parameters — a 256x reduction.
Rank and Alpha Hyperparameters
The rank r controls the expressivity of the adaptation. Higher rank captures more complex fine-tuning signals but uses more memory. The alpha hyperparameter scales the LoRA output: the effective update is (alpha/r) × BA. Setting alpha = 2r means the scaling is constant at 2 regardless of rank choice, which simplifies hyperparameter search. Common settings: r=4 or r=8 for most tasks, r=16 or r=64 for complex domain adaptation.
Memory Savings in Practice
| Model | Full Fine-Tune VRAM | LoRA (r=8) VRAM | |-------|--------------------|--------------------| | LLaMA 7B | ~56 GB | ~14 GB | | LLaMA 13B | ~104 GB | ~24 GB | | LLaMA 65B | ~520 GB | ~80 GB |
LoRA is typically applied to the attention weight matrices (Q, K, V, and output projection). Applying it to all linear layers including MLP further improves quality.
Zero Inference Latency
Because W_0 + BA can be pre-merged before serving, LoRA adds exactly zero latency at inference time. This is a critical advantage over adapters and prefix tuning, which add extra computation on every forward pass.
LoRA vs Adapters vs Prefix Tuning
Adapters insert small bottleneck layers between Transformer blocks — they add inference latency. Prefix tuning prepends trainable soft tokens to the input — it reduces the effective context window. LoRA modifies the weight matrices directly with no architectural change and no runtime overhead.
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=8, alpha=16):
super().__init__()
self.weight = nn.Parameter(
torch.randn(out_features, in_features), requires_grad=False
)
self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scale = alpha / rank
def forward(self, x):
base = x @ self.weight.T
lora = x @ self.lora_A.T @ self.lora_B.T
return base + self.scale * lora