What is Latent Space?
Raw images are huge (512×512×3 = 786,432 numbers for a single small image). Running diffusion directly on pixels requires massive GPUs and hours of compute per image. Latent space solves this: a Variational Autoencoder (VAE) compresses images 48×–96× before the diffusion network sees them.
Critically, latent space is smooth and meaningful. Points close to each other in latent space decode to visually similar images. This lets you interpolate between images, edit them mathematically (add 'a smile' by moving in a specific direction), and run efficient generation at consumer scale.
How it works
VAE encoding
A Variational Autoencoder compresses each 512×512×3 image into a 64×64×4 latent tensor — 48× smaller. The diffusion process runs on this compact representation, then the VAE decoder expands back to pixels at the end.
Semantic structure
Latent space is not just compressed pixels — it's semantically structured. Similar concepts cluster together, so mathematical operations (interpolation, vector arithmetic) produce meaningful image changes.
Common use cases
- Running diffusion models on consumer GPUs (all Stable Diffusion variants)
- Smooth interpolation between images for animation
- Semantic editing (add/remove attributes via latent-space directions)
- Inpainting — re-generating specific regions while preserving context