What is classifier-free guidance?
Every text-to-video model has a CFG scale parameter. Most users learn quickly that turning it up increases prompt adherence and turning it down makes outputs more varied. What that parameter is actually doing is less widely understood.
Definition
Classifier-free guidance (CFG) is a technique that amplifies the influence of a conditioning signal (typically a text prompt) on a generative model's output. It was introduced by Ho and Salimans in 2022 as a simplified alternative to classifier guidance that does not require a separate classifier model.
A CFG scale of 1.0 means the model generates freely without conditioning. Higher values increasingly push the generation toward the conditioning signal. The typical range in practice is 3–15, with the optimal value depending on the model and the task.
The problem CFG solves
Diffusion models learn a probability distribution over data (video, images). At inference, sampling from this distribution produces diverse, plausible outputs. But diverse is not always what you want. If you write a specific prompt, you want the output to match it.
A naive way to increase prompt adherence is to train the model harder on the conditioning signal. But this reduces diversity and tends to produce saturated, over-sharpened outputs that feel artificial.
CFG provides a better tradeoff: it lets you control prompt adherence at inference time, without changing the model, and without the quality degradation of aggressive conditioning during training.
How CFG works
During training, CFG randomly drops the conditioning signal for a portion of examples, training the model to generate both conditioned and unconditioned outputs from the same set of weights.
At inference, two forward passes are run: one with the prompt (conditioned generation) and one without (unconditioned generation). The final output is extrapolated in the direction from unconditioned toward conditioned, by the CFG scale factor:
output = unconditioned + scale × (conditioned - unconditioned)
At scale = 1, this returns the conditioned output directly. At scale > 1, the output is pushed further in the direction of the conditioning signal than the model would naturally go. This "overshoot" produces outputs that match the prompt more closely, at the cost of some diversity and, at very high scales, visual artifacts.
Choosing a CFG scale
Lower values (3–5) produce more varied outputs that loosely follow the prompt. Good for creative exploration where you want the model to interpret the prompt broadly.
Mid values (7–10) produce reliable prompt adherence with good output quality. The standard range for most generation tasks.
Higher values (12+) produce tight prompt adherence but can cause oversaturation, color banding, or unnatural sharpness. Useful when exact prompt matching matters more than visual naturalness.
CFG variants
Negative prompting extends CFG by replacing the unconditional baseline with a negative prompt: instead of "no conditioning," the model is pushed away from a specific description. This allows more precise control over what the output avoids.
Perturbed Attention Guidance (PAG) and other CFG alternatives aim to provide similar guidance effects with better quality at high scale values.
LTX-2's Multimodal Guider separates CFG into two independent parameters: text guidance strength and cross-modal alignment strength. These can be controlled independently, allowing you to increase prompt adherence without affecting audio-video synchronization.
How to use CFG with LTX-2
The CFG scale is a standard parameter in the LTX-2 API. For most generation tasks, starting at 7–9 produces reliable results. If outputs are over-saturated or artifacted, lower the scale. If outputs diverge too much from the prompt, raise it.
The Multimodal Guider parameters, exposed for audio-conditioned generation, follow the same principle: tune text guidance and cross-modal strength independently until outputs match both prompt and audio as intended. Both parameters are also accessible for local generation in LTX Desktop.