What are Diffusion Models?
Diffusion models dominated generative image AI starting in 2022 because they produce higher-quality and more diverse outputs than the GANs (Generative Adversarial Networks) that preceded them. The key insight: instead of teaching one network to go directly from noise to image (which GANs do), train a network to remove a small amount of noise — a much easier task — then iterate 20–50 times to turn noise into a finished image.
Variants include DDPM (the original formulation), DDIM (faster deterministic sampling), and rectified flow matching (used in Stable Diffusion 3 and FLUX for further quality gains with fewer steps).
How it works
Forward process (training)
During training, each image has noise gradually added over 1000 steps until it becomes pure Gaussian noise. The network learns to predict the noise at each step so it can be subtracted.
Reverse process (generation)
At inference, start from pure noise. Run the network 20–50 times, each time subtracting the predicted noise. After enough iterations, a coherent image emerges.
Text conditioning
Text prompts are converted to embeddings (via CLIP or a similar encoder) and injected into each denoising step, steering the output toward an image matching the prompt's semantics.
Common use cases
- Text-to-image generation (Stable Diffusion, DALL-E, Imagen, FLUX)
- Text-to-video generation (Sora, Veo, Kling — all diffusion transformers)
- Image editing via latent-space manipulation (inpainting, outpainting)
- Super-resolution and upscaling
- Text-to-audio and music generation (Stable Audio)