What Is DeepSpeed?
DeepSpeed is Microsoft's open-source library for distributed deep learning training. Its core contribution is ZeRO (Zero Redundancy Optimizer), a set of techniques that eliminate the memory redundancies in standard distributed training and allow models much larger than any single GPU's memory to be trained efficiently.
DeepSpeed integrates with PyTorch and HuggingFace Trainer, so adding it to an existing training script is often a configuration change rather than a code rewrite.
The ZeRO Stages
Standard data parallelism copies the full model to every GPU. ZeRO eliminates this redundancy progressively:
ZeRO Stage 1 — shards optimizer states across GPUs. Each GPU stores optimizer state only for the parameters it's responsible for. Memory reduction: ~4x.
ZeRO Stage 2 — additionally shards gradients. Memory reduction: ~8x.
ZeRO Stage 3 — additionally shards model parameters. Each GPU stores only a fraction of the model weights. Memory reduction: scales linearly with the number of GPUs. With 8 GPUs, each stores 1/8 of the parameters.
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9
},
"bf16": {
"enabled": true
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16
}
HuggingFace Trainer Integration
Adding DeepSpeed to a HuggingFace Trainer run requires passing the config path:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./outputs",
deepspeed="ds_config_zero3.json", # just add this
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=3,
bf16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
Run with: deepspeed --num_gpus 8 train.py
ZeRO-Offload
ZeRO-Offload extends Stage 2 to offload optimizer states and gradients to CPU RAM or NVMe storage. This allows training larger models on fewer GPUs at the cost of slower training (PCIe bandwidth is the bottleneck):
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
}
}
}
With ZeRO-Offload, a single A100 (80GB) with 256GB of CPU RAM can train a 65B parameter model.
DeepSpeed-Chat for RLHF
DeepSpeed-Chat provides an end-to-end RLHF pipeline (supervised fine-tuning → reward model training → PPO reinforcement learning) optimized for training at scale. It's the reference implementation for training ChatGPT-style models on your own data.
DeepSpeed vs FSDP
PyTorch's FSDP (Fully Sharded Data Parallel) is the built-in alternative. FSDP is simpler to configure for standard cases and integrates more natively with PyTorch. DeepSpeed has better performance for very large models (100B+), more optimization options, and the CPU/NVMe offload capabilities that FSDP lacks. For training up to 13B parameters, FSDP is often sufficient and easier. For larger models or when CPU offload is needed, DeepSpeed is the stronger choice.