Production

How To Improve LTX-2.3 Prompt Adherence: Tips for More Accurate AI Video

Get better results from LTX-2.3 with prompt techniques that improve adherence, motion accuracy, and visual fidelity in AI video generation.

LTX Team
Start Now
How To Improve LTX-2.3 Prompt Adherence: Tips for More Accurate AI Video
Table of Contents:
Key Takeaways:
  • LTX-2.3 processes prompts through Gemma 3 across both video and audio generation simultaneously, meaning specific, structured descriptions get prioritized while vague language gets diluted — write like a cinematographer, not a poet.
  • The most effective prompt structure is: main action first, precise motion details second, character and environment third, explicit camera and lighting last — kept under 200 words with one main action per 2–3 seconds of video.
  • Prompt text and guidance parameters work together: cfg_scale controls how strongly the model follows your description, stg_scale controls temporal coherence, and image conditioning or camera LoRAs provide structural anchoring when prompts alone aren't enough.

You write a detailed prompt, hit generate, and the output ignores half of what you asked for. The camera moves when you wanted it static. The character changes appearance mid-clip. The lighting shifts from warm to cold for no reason.

Prompt adherence is the gap between what you describe and what the model actually produces. In text to video prompts, that gap is wider than in image generation because the model has to maintain consistency across dozens of frames while interpreting motion, timing, and spatial relationships from text alone.

LTX-2.3 processes prompts through Gemma 3, a multilingual text encoder with multi-layer feature aggregation and learnable registers (thinking tokens) that produce separate embeddings for video and audio conditioning. Understanding how this pipeline works, and writing prompts that align with it, is the difference between results that land and results that drift.

How LTX-2.3 Reads Your Prompt

LTX-2.3 is a DiT-based audio-video foundation model with 14 billion parameters for video and 5 billion for audio, processing 48 shared transformer blocks with cross-modal attention. The text encoder converts your prompt into embeddings that guide both modalities simultaneously.

This means every word in your prompt competes for attention across both the visual and audio generation paths. Vague descriptions get diluted. Specific, structured descriptions get prioritized because the model can map them cleanly to its internal representation.

The practical takeaway: write like a cinematographer describing a shot list, not like a poet describing a feeling.

The Prompt Structure That Works

The official LTX-2.3 Prompt Guide recommends building prompts using a specific structure. Based on the documentation, here is what works best:

Start With the Main Action

Open with a single sentence describing the primary action. This anchors the model’s generation around your core intent before anything else competes for attention.

Weak: “A beautiful scene with a person walking through a forest with birds and sunlight.”

Strong: “A woman in a red jacket walks steadily along a narrow dirt trail through dense pine trees.”

The strong version gives the model one clear subject, one clear action, and a specific environment. The weak version asks the model to juggle multiple elements without hierarchy.

Add Specific Movement and Gesture Details

After establishing the main action, layer in precise motion descriptions. AI video generation prompts work best when movements are literal and chronological rather than abstract.

Instead of “she moves gracefully,” write “she shifts her weight to her left foot and turns her head slowly toward the camera.” The model can parse physical mechanics. It struggles with subjective qualities.

Describe Appearances and Environment

Character descriptions should be concrete: hair color, clothing, posture, distinguishing features. Background details should place the subject in a specific environment with enough detail to anchor the scene but not so much that the model splits focus.

“A man in his 30s with short dark hair, wearing a gray crew-neck sweater, sits at a wooden desk in a dimly lit home office with bookshelves behind him.”

Specify Camera and Lighting

Camera direction is one of the most impactful elements for prompt adherence. LTX-2.3 supports camera motion LoRAs for movements like dolly in, dolly out, dolly left, dolly right, jib up, jib down, and static shots. When using camera LoRAs, explicitly describe the intended movement in your prompt.

For lighting, be specific: “warm golden-hour sunlight from the left” is more useful than “nice lighting.”

Keep It Under 200 Words

The documentation recommends keeping prompts within 200 words. Longer prompts tend to dilute the model’s attention, causing it to lose track of earlier instructions by the time it processes later ones. This is especially true for motion-heavy scenes where temporal consistency already taxes the generation pipeline.

Techniques That Improve Prompt Accuracy

Write Chronologically

Describe actions in the order they should happen. The model processes your prompt linearly and maps it to the temporal dimension of the video. If you describe the ending before the beginning, the output may scramble the sequence.

“The camera holds on a close-up of a coffee cup on a wooden table. Steam rises slowly. A hand enters from the right and wraps around the cup. The hand lifts it out of frame.”

Each sentence maps to a temporal beat. The model can follow this because the structure mirrors the generation flow.

Be Literal, Not Metaphorical

Text to video prompts should describe what the camera would physically see, not what the scene means thematically. “A tense confrontation” gives the model nothing to work with. “Two men stand three feet apart, both leaning forward with clenched fists at their sides” gives it specific spatial relationships and body language.

Use the Automatic Prompt Enhancement Feature

LTX-2.3 pipelines support automatic prompt enhancement via an enhance_prompt parameter. This passes your prompt through an enhancement step that refines the input for better model adherence. It is useful when you want to explore variations quickly, as the enhancement step may produce different refinements depending on the generation seed.

For maximum creative control, bypass the enhancer and write raw prompts directly using the structure above.

Match Prompt Complexity to Video Length

Shorter clips need simpler prompts. A 2-second clip (roughly 49 frames at 25fps, satisfying the frame count constraint where (F-1) % 8 == 0) should describe one action with one camera angle. Longer clips can handle more scene elements, but each new element you add dilutes attention slightly.

A good rule: one main action per 2-3 seconds of video. If your prompt describes five distinct actions, the output will likely compress or skip some of them.

Common Prompt Mistakes and How to Fix Them

Overloading With Too Many Actions

“A man walks into a room, sits down, opens a laptop, starts typing, then looks up and smiles at someone off-camera” packs five distinct actions into one prompt. The model will likely compress the middle actions or skip them entirely.

Fix: Focus on one or two key actions per generation. Use the retake pipeline to regenerate specific segments if you need a longer, multi-action sequence.

Vague or Abstract Descriptions

“An emotional scene in a beautiful setting” gives the model almost no actionable information. Every word in your prompt should map to something the model can render: a color, a position, a movement, a material, a light source.

Ignoring Camera Direction

If you do not specify camera behavior, the model defaults to whatever it generates. This often means unnecessary camera drift or random angle changes that break the coherence of your scene. Always specify: static, slow pan, tracking shot, close-up, wide shot.

Conflicting Instructions

“A person running quickly through a field in slow motion” sends contradictory signals. The model has to choose between “running quickly” (fast motion) and “slow motion” (temporal effect). If you want slow-motion capture of fast action, describe the visual result: “A person mid-stride, suspended in slow motion, each step drawn out across seconds.”

Guidance Parameters: The Technical Controls for Prompt Adherence

Prompt adherence in LTX-2.3 is controlled by both the prompt text and the guidance parameters in MultiModalGuiderParams. A well-written prompt with suboptimal guidance parameters will still produce poor adherence. Understanding these controls is essential for getting the most out of your prompts.

cfg_scale (Prompt Adherence Strength)

cfg_scale is the direct numerical control for how strongly the model follows your prompt. Higher values produce output that adheres more closely to your description; lower values give the model more creative latitude. Typical range: 2.0–5.0 for video, 7.0 for audio. The documented example configuration sets cfg_scale=3.0 for video and cfg_scale=7.0 for audio — the audio channel uses a higher value because it needs to follow timing and content cues more precisely.

If your output feels like it is ignoring parts of your prompt, increasing cfg_scale is often the first thing to try. If the output looks over-constrained or unnatural, reducing it gives the model more flexibility.

stg_scale (Temporal Coherence)

stg_scale controls Spatio-Temporal Guidance, which influences temporal coherence across frames. Higher values improve frame-to-frame consistency; lower values allow more frame-level variation. Typical range: 0.5–1.5. This parameter is directly relevant to the camera drift and character consistency issues described in prompting guides: if a character’s appearance drifts between frames despite a detailed prompt, stg_scale is the lever to adjust.

rescale_scale (Saturation Control)

rescale_scale prevents over-saturation at high CFG values. When you push cfg_scale high for stronger prompt adherence, colors and contrast can become overly intense. rescale_scale (typical value: 0.7) counteracts this, maintaining natural-looking output even at high guidance strengths.

For full parameter reference and advanced guidance configurations, see the ltx-pipelines documentation.

Advanced Tips for Power Users

Using Image Conditioning for Better Adherence

When prompt-only generation does not achieve the visual consistency you need, use the image-to-video pipeline. Providing a reference image as the first frame anchors the model’s generation to a specific visual starting point, dramatically improving consistency for character appearance, environment, and lighting.

The TI2VidTwoStagesPipeline supports image conditioning, where the image latent replaces the latent at a specific frame with the encoded image for strong control over that frame.

Combining LoRAs With Prompts

LTX-2.3 supports several pre-trained LoRA adapters that complement your prompts. Camera control LoRAs (dolly in, dolly out, dolly left, dolly right, jib up, jib down, static) give you precise motion control that prompts alone cannot reliably achieve. IC-LoRA adapters like Union Control, Motion Track Control, Pose Control, and Detailer enable video-to-video transformations with reference-driven generation.

When using camera LoRAs, always describe the camera movement in your prompt text as well. The LoRA handles the mechanical execution; the prompt provides the model with semantic context about what the movement should look like.

Using Dev vs. Distilled Models for Different Prompt Styles

LTX-2.3 offers two complementary model paths. The Dev model (ltx-2.3-22b-dev.safetensors) is the full-fidelity pipeline supporting multi-stage sampling, multimodal guidance (CFG, STG, modality CFG), and IC-LoRA workflows. It prioritizes control and stability.

The Distilled model (ltx-2.3-22b-distilled.safetensors) uses 8 predefined sigmas for fastest inference. It is ideal for rapid prompt iteration when you are still experimenting with wording and structure.

A practical workflow: iterate on prompt wording with Distilled (fast feedback loops), then switch to Dev for the final render once your prompt is dialed in. Keep the random seed fixed to isolate prompt changes from generation variance.

Putting It All Together

Better prompt adherence is not about writing longer prompts. It is about writing prompts that align with how LTX-2.3 processes information: one clear action, described chronologically, with specific visual details, explicit camera direction, and no conflicting instructions. And once your prompt is well-crafted, use the guidance parameters to tune how strongly the model responds to it.

Start with the prompting guide for foundational techniques. Experiment with the Distilled model for fast iteration. Use image conditioning when visual consistency matters. Layer in LoRAs for precise camera and motion control.

The gap between what you describe and what you get shrinks every time you make your prompts more specific. Try it in the LTX-2.3 playground.

No items found.