The MMDiT Architecture

Stable Diffusion 3 Medium abandons the U-Net backbone of SD 1.x and SDXL in favor of a Multimodal Diffusion Transformer (MMDiT). The key innovation: image patches and text tokens flow through the same transformer blocks with separate weight sets, allowing bidirectional information exchange. Images condition on text, but text representations also adapt to image content during the diffusion process.

This architecture change, detailed in the Stability AI announcement, explains the most visible improvement in SD3 Medium: text rendering. Previous SD models struggled to reliably render words inside images because language and vision were coupled late in the pipeline. With MMDiT, text tokens are present throughout denoising.

Three Text Encoders

SD3 Medium uses three text encoders simultaneously:

CLIP-L (77 token limit) - captures broad semantic meaning

CLIP-G (77 token limit) - higher-capacity CLIP variant for style/composition

T5-XXL (512 token limit) - captures detailed, structured language understanding

All three embeddings are concatenated and passed to the MMDiT blocks. In practice, T5-XXL alone can be dropped to save ~10GB VRAM with minimal quality loss - useful for consumer hardware. The HuggingFace model page documents this tradeoff.

Inference With Diffusers

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

# Drop T5 encoder to save VRAM
pipe.text_encoder_3 = None
pipe.tokenizer_3 = None

image = pipe(
    "A coffee shop menu sign with 'Flat White $4.50' written on a chalkboard",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
image.save("sd3-output.png")

50-Step vs 28-Step Inference

SD3 Medium was trained for 50-step inference but produces good results at 28 steps with minimal quality loss. Unlike SDXL where fewer steps visibly degraded outputs, the flow matching objective in SD3 distributes quality more uniformly across timesteps. For rapid iteration, 20 steps is workable; for final renders, 50 steps.

Memory Optimization With fp8

The Diffusers SD3 documentation covers fp8 quantization for the transformer. With fp8 transformer + fp16 VAE + no T5:

Full fp16 (all encoders): ~18GB VRAM
fp8 transformer, no T5: ~8GB VRAM
CPU offload + no T5: ~6GB peak VRAM, runs on 8GB cards

ComfyUI Workflow

ComfyUI's stable-diffusion-3 nodes handle the three-encoder setup automatically. Download the model checkpoint (safetensors format) from HuggingFace, place in ComfyUI/models/checkpoints/, and use the SD3 node group. The sampling method should be set to dpmpp_2m or euler with sgm_uniform scheduler for best results.

The 2B open-weights release makes SD3 Medium the most capable freely downloadable image generation model as of mid-2026, positioned between SDXL (older architecture) and FLUX.1-dev (larger, non-commercial).

Stable Diffusion 3 Medium: Stability AI's Multimodal Diffusion Transformer

The MMDiT Architecture

Three Text Encoders

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Inference With Diffusers

50-Step vs 28-Step Inference

Memory Optimization With fp8

ComfyUI Workflow

The workspace your team
actually needs

Stable Diffusion 3 Medium: Stability AI's Multimodal Diffusion Transformer

The MMDiT Architecture

Three Text Encoders

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Inference With Diffusers

50-Step vs 28-Step Inference

Memory Optimization With fp8

ComfyUI Workflow

The workspace your teamactually needs

The workspace your team
actually needs