PEFT Library Overview
The PEFT (Parameter-Efficient Fine-Tuning) library from HuggingFace implements four main techniques:
- LoRA: Low-rank adaptation — adds trainable rank decomposition matrices to attention layers
- QLoRA: LoRA on a 4-bit quantized base model — the memory breakthrough
- Prefix tuning: Prepends learned virtual tokens to the input
- IA3: Infused Adapter by Inhibiting and Amplifying — even smaller than LoRA
LoRA and QLoRA dominate production use because they're compatible with most architectures and merge back into the base model at inference time.
LoRA: How It Works
LoRA decomposes weight updates into low-rank matrices. For a weight matrix W (d × k), instead of updating W directly, LoRA learns two matrices: A (d × r) and B (r × k), where r << min(d, k). The effective weight update is BA, but only A and B are trained.
Key hyperparameters:
- r (rank): 8-64 typically. Higher r = more capacity but more memory. Start with r=16.
- lora_alpha: Scaling factor, usually 2× the rank value (alpha=32 for r=16)
- target_modules: Which layers to apply LoRA to. For LLaMA:
["q_proj", "v_proj"]or all attention layers
Memory Usage: 7B fp16 vs QLoRA
| Configuration | VRAM Required | |---|---| | 7B fp16 (full fine-tune) | 28GB+ | | 7B fp16 (inference only) | 14GB | | 7B int8 + LoRA | 10GB | | 7B int4 (QLoRA) | 4-5GB | | 7B int4 + gradient checkpointing | 3.5-4GB |
QLoRA makes 7B fine-tuning possible on a single RTX 4090 (24GB) or even an RTX 3090/4080 (16-24GB).
Full QLoRA Training Script
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 33,554,432 || all params: 6,771,970,048 || trainable%: 0.4955
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
training_args = TrainingArguments(
output_dir="./llama2-7b-finetuned",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # Effective batch size = 16
gradient_checkpointing=True,
optim="paged_adamw_32bit",
learning_rate=2e-4,
bf16=True,
logging_steps=25,
save_strategy="epoch",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
Merging and Pushing to HuggingFace Hub
After training, merge the LoRA weights back into the base model for standard inference:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, "./llama2-7b-finetuned/checkpoint-final")
merged_model = model.merge_and_unload()
merged_model.push_to_hub("your-username/llama2-7b-custom")
tokenizer.push_to_hub("your-username/llama2-7b-custom")
Dataset Formats
The TRL SFTTrainer accepts two common formats:
Alpaca format (instruction/input/output fields):
{"instruction": "Summarize this text", "input": "Long text here...", "output": "Short summary."}
ShareGPT format (conversations array):
{"conversations": [{"from": "human", "value": "Hello"}, {"from": "gpt", "value": "Hi there!"}]}
The QLoRA paper demonstrates that QLoRA fine-tuned Guanaco-65B matches ChatGPT on the Vicuna benchmark — proving that 4-bit quantization with LoRA does not meaningfully degrade fine-tune quality versus full precision training.