What is ControlNet?
Pure text-to-image models struggle with precise spatial control. If you prompt 'a person dancing' you get a random pose. ControlNet solves this by adding a second input stream — an image-derived condition like a stick-figure pose skeleton — that guides the generation's spatial structure while the text prompt controls style and content.
Introduced in early 2023, ControlNet made Stable Diffusion practical for production use cases that need reproducible layouts: product photography, architectural visualization, character sheets, and pose-matched series of images.
How it works
Control image extraction
A pre-processor converts an input (usually a reference photo) into a structural representation — a pose skeleton from OpenPose, depth map from MiDaS, or Canny edges from OpenCV.
Trainable branch
ControlNet clones the Stable Diffusion encoder into a trainable branch. The control image flows through this branch and its outputs are injected into the main model's denoising steps.
Multi-ControlNet composition
Multiple ControlNets can be applied simultaneously — e.g., combining a depth map (for scene geometry) with a pose skeleton (for the subject), giving you control over both.
Types & variants
- OpenPoseMatches human body poses — ideal for character consistency and action shots.
- CannyMatches edges — best for preserving product or building silhouettes.
- DepthMatches scene depth — preserves 3D layout while changing style or subject.
- LineartMatches clean line drawings — converts sketches to finished renders.
- ScribbleMatches rough scribbles — loosest control, most creative latitude.
Common use cases
- Generating product photography with a consistent pose across multiple AI images
- Converting rough sketches into polished illustrations without losing composition
- Character sheets with the same pose from different angles
- Architectural visualization preserving building proportions while changing materials