From 50 Steps to 1
Standard diffusion models denoise an image over 20–50 steps. Each step requires a full forward pass through a multi-billion parameter UNet, making real-time generation impractical. SDXL-Turbo collapses this to 1–4 steps using Adversarial Diffusion Distillation (ADD), without the blurry output that plagues earlier distillation attempts.
How Adversarial Diffusion Distillation Works
ADD combines two loss signals:
- Score distillation loss — the student (SDXL-Turbo) is trained to match the multi-step outputs of a frozen SDXL teacher
- Adversarial loss — a discriminator trained on real images pushes the student to produce sharp, photorealistic outputs even in one step
The adversarial component is the key innovation. Score distillation alone tends to produce over-smoothed images; the discriminator restores high-frequency detail.
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/sdxl-turbo",
torch_dtype=torch.float16,
variant="fp16",
)
pipe.to("cuda")
# 1-step generation
image = pipe(
prompt="a golden retriever playing in autumn leaves, cinematic lighting",
num_inference_steps=1,
guidance_scale=0.0, # CFG disabled at 1 step
).images[0]
image.save("output_1step.png")
# 4-step for higher quality
image_4step = pipe(
prompt="a golden retriever playing in autumn leaves, cinematic lighting",
num_inference_steps=4,
guidance_scale=0.0,
).images[0]
image_4step.save("output_4step.png")
Note that CFG (classifier-free guidance) is disabled at 1 step — it hurts quality at this extreme distillation level.
SDXL-Lightning: ByteDance's Alternative
ByteDance released SDXL-Lightning using a different distillation approach (progressive adversarial distillation). At 4 steps, SDXL-Lightning tends to produce slightly sharper details than SDXL-Turbo, though SDXL-Turbo has better semantic coherence at 1 step. For production use, 4-step SDXL-Lightning is usually the better trade-off.
from diffusers import StableDiffusionXLPipeline, EulerDiscreteScheduler
from huggingface_hub import hf_hub_download
import torch
base = "stabilityai/stable-diffusion-xl-base-1.0"
repo = "ByteDance/SDXL-Lightning"
ckpt = "sdxl_lightning_4step_unet.safetensors"
pipe = StableDiffusionXLPipeline.from_pretrained(base, torch_dtype=torch.float16).to("cuda")
pipe.unet.load_state_dict(torch.load(hf_hub_download(repo, ckpt)))
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
Real-Time Interactive Generation
With 1-step generation running at 200ms on an A10G GPU, it becomes possible to regenerate images on every user input event. Gradio's streaming interface combined with SDXL-Turbo enables a slider-controlled real-time generation experience where the image updates as you type.
WebGPU acceleration via the Transformers.js port brings similar (though slower) capability to the browser without a server.
Comparison: SD 1.5 vs SDXL-Turbo
Standard SD 1.5 at 20 steps takes roughly 1.2 seconds on an A10G. SDXL-Turbo at 1 step takes 180ms on the same hardware — a 6.7x speedup — while producing images at significantly higher resolution and quality. The trade-off is the non-commercial license on SDXL-Turbo.