What is Stable Diffusion?
Stable Diffusion pioneered accessible AI image generation. Unlike earlier models (DALL-E, Imagen) that ran only on cloud GPUs, Stable Diffusion's latent-space design lets it run on consumer hardware — a single NVIDIA RTX 3060 can generate images in seconds.
The open-source license enabled an entire ecosystem: custom checkpoints, LoRA fine-tunes, ControlNet for pose conditioning, and thousands of community tools. This made Stable Diffusion the backbone of most Automatic1111, ComfyUI, and Invoke AI workflows.
How it works
Latent-space diffusion
Stable Diffusion compresses images into a 64× smaller latent space, then trains a denoising network to reverse a gradual noising process. At inference, it starts from pure noise and progressively removes noise guided by your text prompt's embedding.
Text conditioning via CLIP
Text prompts are converted to embeddings using OpenAI's CLIP text encoder. Those embeddings condition each denoising step, steering the output toward an image that matches the prompt's semantics.
CFG scale controls prompt adherence
The classifier-free guidance (CFG) scale parameter controls how strictly the model follows your prompt. Higher values produce more literal images but can over-saturate; lower values allow the model more creative latitude.
Types & variants
- SD 1.5The original 2022 model — small, fast, most community LoRAs exist for this base.
- SDXLHigher-resolution successor (1024×1024 native) with better composition and text rendering.
- SD 3 / 3.5Current generation with improved typography, diverse subjects, and multi-subject prompts.
- SD TurboDistilled variant that generates usable images in 1–4 steps for real-time apps.
Common use cases
- Text-to-image generation for illustrations, concept art, and marketing creative
- Fine-tuning on custom styles or brands via LoRAs
- Image-to-image workflows (sketches → finished art, photos → illustrations)
- Inpainting and outpainting for photo editing
- ControlNet-guided generation with pose, depth, or edge conditioning