What is ControlNet?

Text prompts are approximate. "A person walking through a forest" does not tell a video generation model where the person is in the frame, how they move, or what their pose looks like at each keyframe. ControlNet provides the structural precision that prompts alone cannot.

Definition

ControlNet is a neural network architecture that adds conditional structural control to diffusion models. It accepts spatial guidance signals — depth maps, edge maps, pose skeletons, segmentation masks, optical flow — and uses them to constrain the structure of generated outputs beyond what text prompting achieves.

The original paper by Zhang et al. (2023) demonstrated that you could attach a trainable copy of a diffusion model's encoder conditioned on a structural input, without degrading the base model's generation quality.

Why ControlNet exists

Text-to-video models are powerful at generating plausible content, but they have limited geometric precision. A prompt like "close-up of hands playing piano" generates something plausible. The hand positions, the camera angle, and the spatial relationship between subject and background are largely determined by training data distribution.

For production use cases, this is not sufficient. A pre-visualization shot needs the camera at a specific position. A product ad needs the object in a specific part of the frame. Character animation needs consistent pose at each keyframe. ControlNet provides that precision by conditioning generation on explicit spatial structure.

How ControlNet works

ControlNet duplicates the encoder portion of the base diffusion model. This copy is trained on pairs of (conditioning signal, output), where the conditioning signal is the structural guide and the output is the generated content that matches it.

The architectural innovation is "zero convolutions": the connections between the ControlNet encoder and the base model start at zero weight. At the beginning of training, ControlNet has no effect. As training progresses, it learns to inject structural information gradually, without disrupting the pretrained model's capabilities.

At inference, you provide both the base prompt and the structural conditioning signal. The base model generates content; the ControlNet encoder modifies the generation at each step to respect the structural constraint.

Types of control signals

Depth maps encode the relative distance of objects from the camera. Used to control foreground-background separation and 3D layout.

Edge maps (Canny, HED) encode object boundaries and contours. Used for precise shape control.

Pose skeletons (OpenPose) encode body keypoints. Used for character pose control in animation and video.

Segmentation masks encode semantic regions. Used to specify which areas of the frame contain which content types.

Optical flow encodes motion between frames. Used to control motion direction and speed in video generation.

Normal maps encode surface orientation. Used for lighting and material consistency.

Multiple signals can be combined at inference, with separate weights controlling how much each guide influences the output.

A brief history

"Adding Conditional Control to Text-to-Image Diffusion Models" by Lvmin Zhang and Maneesh Agrawala (UC Berkeley) appeared in February 2023. It became one of the most rapidly adopted techniques in the generative image community, with community-trained ControlNet models for dozens of control signal types appearing within weeks of publication. Extension to video followed, with temporal variants maintaining consistency across frames.

ControlNet for video generation

Video ControlNet extends the image-level approach to handle temporal control. The conditioning signals now operate across frames: a sequence of depth maps or pose skeletons specifying how structure should evolve over time.

This enables precise motion, camera behavior, and subject positioning that is not achievable through text alone. For pre-visualization workflows, it provides storyboard-level control over generated shots. For character animation, it enables motion-capture-driven generation.

How LTX-2 uses ControlNet

LTX-2.3 supports structural conditioning through its conditioning signal architecture, which accepts spatial guidance inputs alongside text and image prompts. Spatiotemporal attention allows conditioning signals to influence per-frame structure and temporal motion coherently.

For developers integrating via the LTX-2 API, structural conditioning is exposed at the generation endpoint. Depth maps, edge maps, and pose guides can be passed as conditioning inputs alongside the primary prompt, with separate guidance scale controls for each signal.

What Is ControlNet? How It Works For Video Generation