Unsloth: Fine-Tune Llama 3 and Mistral 2x Faster With 70% Less Memory

Unsloth rewrites attention and LoRA kernels with hand-tuned CUDA, delivering 2-5x faster training and 70% less VRAM usage without any accuracy loss.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 3, 2026

8 min read

// tags

#unsloth#fine-tuning#lora#qlora#speed

FIG. ART-28

8 min read

“

Unsloth: Fine-Tune Llama 3 and Mistral 2x Faster With 70% Less Memory

// reading plan

sections

362

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format — export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

What Is Unsloth?

Unsloth is an open-source library that dramatically accelerates LLM fine-tuning by replacing the default HuggingFace TRL kernels with hand-optimized CUDA implementations. The core insight: PyTorch's generic attention and LoRA backward passes leave significant performance on the table. Unsloth rewrites these from scratch.

The results are striking: 2-5x faster training compared to vanilla TRL, 70% less VRAM usage (a 7B parameter model fits on an 8GB GPU), and zero measured accuracy degradation across benchmark evaluations. It works with Llama 3, Mistral, Phi-3, Gemma, and most popular open-source architectures.

Why Is It Faster?

Unsloth achieves its speedup through three mechanisms:

Hand-tuned CUDA kernels — the attention forward and backward passes are rewritten in Triton rather than relying on PyTorch's generic implementations. This eliminates unnecessary memory allocations in the critical path.

Smarter LoRA math — the standard LoRA backward pass computes gradients in a way that scales poorly with sequence length. Unsloth restructures the computation graph to share intermediate results.

Chunked cross-entropy — instead of materializing the full logit matrix before computing loss (which is enormous for large vocabularies), Unsloth processes it in chunks. This alone often saves 1-2GB of VRAM.

Installation and Basic Training Loop

pip install unsloth

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,          # auto-detect
    load_in_4bit=True,
)

# Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=True,
        output_dir="outputs",
    ),
)
trainer.train()

Unsloth Studio

For teams that prefer a GUI, Unsloth Studio provides a web-based interface for dataset upload, training configuration, and run monitoring. It sits on top of the same optimized kernels — you get the speed benefits without writing any training code.

Benchmarks

On an RTX 3090 (24GB VRAM), Unsloth achieves roughly 2.2x the tokens-per-second of standard TRL for Llama 3 8B with 4-bit QLoRA, while keeping VRAM usage under 12GB. The full benchmark table covering multiple GPU types and model sizes is maintained in the repo.

Unsloth: Fine-Tune Llama 3 and Mistral 2x Faster With 70% Less Memory

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

What Is Unsloth?

Why Is It Faster?

Installation and Basic Training Loop

Unsloth Studio

Benchmarks

Resources

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

Unsloth: Fine-Tune Llama 3 and Mistral 2x Faster With 70% Less Memory

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

What Is Unsloth?

Why Is It Faster?

Installation and Basic Training Loop

Unsloth Studio

Benchmarks

Resources

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs