DeepSpeed: Train Models With Billions of Parameters on Limited GPUs

Microsoft's DeepSpeed enables training of 100B+ parameter models across distributed GPU clusters through ZeRO optimization stages, CPU offloading, and RLHF support.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 16, 2026

8 min read

// tags

#deepspeed#microsoft#distributed-training#zero#fsdp

FIG. ART-30

8 min read

“

DeepSpeed: Train Models With Billions of Parameters on Limited GPUs

// reading plan

sections

439

words

min read

// Developer Tools

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Microsoft has started canceling Claude Code licenses for its employees, signaling a shift in AI tooling strategy. This post explains the context, implications, and what developers should consider.

3 min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

HuggingFace Trainer Integration

Adding DeepSpeed to a HuggingFace Trainer run requires passing the config path:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./outputs",
    deepspeed="ds_config_zero3.json",    # just add this
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=3,
    bf16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Run with: deepspeed --num_gpus 8 train.py

ZeRO-Offload

ZeRO-Offload extends Stage 2 to offload optimizer states and gradients to CPU RAM or NVMe storage. This allows training larger models on fewer GPUs at the cost of slower training (PCIe bandwidth is the bottleneck):

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu"
    }
  }
}

With ZeRO-Offload, a single A100 (80GB) with 256GB of CPU RAM can train a 65B parameter model.

DeepSpeed-Chat for RLHF

DeepSpeed-Chat provides an end-to-end RLHF pipeline (supervised fine-tuning → reward model training → PPO reinforcement learning) optimized for training at scale. It's the reference implementation for training ChatGPT-style models on your own data.

DeepSpeed vs FSDP

PyTorch's FSDP (Fully Sharded Data Parallel) is the built-in alternative. FSDP is simpler to configure for standard cases and integrates more natively with PyTorch. DeepSpeed has better performance for very large models (100B+), more optimization options, and the CPU/NVMe offload capabilities that FSDP lacks. For training up to 13B parameters, FSDP is often sufficient and easier. For larger models or when CPU offload is needed, DeepSpeed is the stronger choice.

DeepSpeed: Train Models With Billions of Parameters on Limited GPUs

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

What Is DeepSpeed?

The ZeRO Stages

HuggingFace Trainer Integration

ZeRO-Offload

DeepSpeed-Chat for RLHF

DeepSpeed vs FSDP

Resources

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

DeepSpeed: Train Models With Billions of Parameters on Limited GPUs

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

What Is DeepSpeed?

The ZeRO Stages

HuggingFace Trainer Integration

ZeRO-Offload

DeepSpeed-Chat for RLHF

DeepSpeed vs FSDP

Resources

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs