The MMDiT Architecture
Stable Diffusion 3 Medium abandons the U-Net backbone of SD 1.x and SDXL in favor of a Multimodal Diffusion Transformer (MMDiT). The key innovation: image patches and text tokens flow through the same transformer blocks with separate weight sets, allowing bidirectional information exchange. Images condition on text, but text representations also adapt to image content during the diffusion process.
This architecture change, detailed in the Stability AI announcement, explains the most visible improvement in SD3 Medium: text rendering. Previous SD models struggled to reliably render words inside images because language and vision were coupled late in the pipeline. With MMDiT, text tokens are present throughout denoising.
Three Text Encoders
SD3 Medium uses three text encoders simultaneously:
- CLIP-L (77 token limit) — captures broad semantic meaning
- CLIP-G (77 token limit) — higher-capacity CLIP variant for style/composition
- T5-XXL (512 token limit) — captures detailed, structured language understanding
All three embeddings are concatenated and passed to the MMDiT blocks. In practice, T5-XXL alone can be dropped to save ~10GB VRAM with minimal quality loss — useful for consumer hardware. The HuggingFace model page documents this tradeoff.
Inference With Diffusers
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
# Drop T5 encoder to save VRAM
pipe.text_encoder_3 = None
pipe.tokenizer_3 = None
image = pipe(
"A coffee shop menu sign with 'Flat White $4.50' written on a chalkboard",
negative_prompt="",
num_inference_steps=28,
guidance_scale=7.0,
).images[0]
image.save("sd3-output.png")
50-Step vs 28-Step Inference
SD3 Medium was trained for 50-step inference but produces good results at 28 steps with minimal quality loss. Unlike SDXL where fewer steps visibly degraded outputs, the flow matching objective in SD3 distributes quality more uniformly across timesteps. For rapid iteration, 20 steps is workable; for final renders, 50 steps.
Memory Optimization With fp8
The Diffusers SD3 documentation covers fp8 quantization for the transformer. With fp8 transformer + fp16 VAE + no T5:
- Full fp16 (all encoders): ~18GB VRAM
- fp8 transformer, no T5: ~8GB VRAM
- CPU offload + no T5: ~6GB peak VRAM, runs on 8GB cards
ComfyUI Workflow
ComfyUI's stable-diffusion-3 nodes handle the three-encoder setup automatically. Download the model checkpoint (safetensors format) from HuggingFace, place in ComfyUI/models/checkpoints/, and use the SD3 node group. The sampling method should be set to dpmpp_2m or euler with sgm_uniform scheduler for best results.
The 2B open-weights release makes SD3 Medium the most capable freely downloadable image generation model as of mid-2026, positioned between SDXL (older architecture) and FLUX.1-dev (larger, non-commercial).