What Is Unsloth?
Unsloth is an open-source library that dramatically accelerates LLM fine-tuning by replacing the default HuggingFace TRL kernels with hand-optimized CUDA implementations. The core insight: PyTorch's generic attention and LoRA backward passes leave significant performance on the table. Unsloth rewrites these from scratch.
The results are striking: 2-5x faster training compared to vanilla TRL, 70% less VRAM usage (a 7B parameter model fits on an 8GB GPU), and zero measured accuracy degradation across benchmark evaluations. It works with Llama 3, Mistral, Phi-3, Gemma, and most popular open-source architectures.
Why Is It Faster?
Unsloth achieves its speedup through three mechanisms:
Hand-tuned CUDA kernels — the attention forward and backward passes are rewritten in Triton rather than relying on PyTorch's generic implementations. This eliminates unnecessary memory allocations in the critical path.
Smarter LoRA math — the standard LoRA backward pass computes gradients in a way that scales poorly with sequence length. Unsloth restructures the computation graph to share intermediate results.
Chunked cross-entropy — instead of materializing the full logit matrix before computing loss (which is enormous for large vocabularies), Unsloth processes it in chunks. This alone often saves 1-2GB of VRAM.
Installation and Basic Training Loop
pip install unsloth
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3-8B-Instruct-bnb-4bit",
max_seq_length=2048,
dtype=None, # auto-detect
load_in_4bit=True,
)
# Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=True,
output_dir="outputs",
),
)
trainer.train()
Unsloth Studio
For teams that prefer a GUI, Unsloth Studio provides a web-based interface for dataset upload, training configuration, and run monitoring. It sits on top of the same optimized kernels — you get the speed benefits without writing any training code.
Benchmarks
On an RTX 3090 (24GB VRAM), Unsloth achieves roughly 2.2x the tokens-per-second of standard TRL for Llama 3 8B with 4-bit QLoRA, while keeping VRAM usage under 12GB. The full benchmark table covering multiple GPU types and model sizes is maintained in the repo.