Conditional Generative Model For Video

Generate video from text, images, and audio. LTX translates structured inputs into coherent, production-grade visual output.

Try LTX-2.3 Now

Key Capabilities

Multi-signal conditioning
Text prompts, reference images, audio inputs, and keyframes. Each signal shapes a different dimension of the output. Language drives narrative, images anchor composition, audio defines rhyth
Precise controllability
Camera behavior, motion dynamics, visual style, and scene composition all respond directly to structured input. Conditioning inputs don't just influence the output, they control it.
Stable, repeatable results
Same inputs, predictable outputs across every run. LTX stays reliable for enterprise pipelines, iterative workflows, and scalable deployment.

Directed creative production

Generate video with precise creative control by conditioning on scripts, mood boards, and reference frames. Text drives the narrative, image conditioning anchors the visual identity, and structured inputs replace manual keyframing.

Try LTX Models

Brand-consistent marketing content

Condition video generation on brand assets, style references, and product imagery to produce content that stays visually consistent. Keep image conditioning fixed and adjust text prompts to iterate across variations fast.

Try LTX Models

Audio-driven visual storytelling

Condition video on voice-over, music, or sound design to sync visual motion with audio structure. Built for music videos, podcast visualizations, and narrative content where timing follows the track.

Try LTX Models

Research & model evaluation

Use LTX to study prompt adherence, cross-modal behavior, temporal consistency, and conditioning strength on a production-grade open-source foundation model.

Try LTX Models

How it works

Try LTX-2 Now

Input:

Text prompt (required) — Natural-language description of actions, scenes, camera behavior, and visual style. The primary signal that drives narrative structure and visual flow.
Reference images (optional) — One or more images to anchor visual composition, style, color palette, or subject identity. Use to maintain consistency across generations or match a specific creative direction.
Audio input (optional) — Voice, music, or sound effects that condition motion timing, pacing, and scene transitions. Synchronizes the visual output to rhythm and spoken narrative.
Keyframes (optional) — Specific frames that define start states, end states, or intermediate composition targets. Frame-level control without manual animation.‍
Generation parameters (optional) — Resolution up to 4K, frame rate up to 50 FPS, duration up to ~20 seconds, seed for reproducibility, and inference steps for quality control.

Output:

Format — MP4, ready for playback, editing, or pipeline integration.
Resolution — Up to 3840x2160 native 4K. Generated at target resolution, not upscaled.
Frame rate — Up to 50 FPS for smooth, cinematic motion.
Duration — Up to ~20 seconds per generation. Longer sequences through chained conditioned generations.‍
Quality — Cinematic-grade output with strong temporal consistency, style coherence, and high fidelity to all conditioning inputs.

Built For

Production-ready video generation built for real-world deployment.

Builders

Product teams, AI startups, and developers building AI-powered video features. Add production-grade video generation as a product capability, not a research project. One API, production-ready results, and no custom orchestration.

Producers at scale

Brands, agencies, and creative teams producing high volumes of content. Turn existing assets into video at scale. Faster iteration, lower production cost, and more output from what you already have.

On-prem operators

Teams that require full control over deployment and data. Run video generation in your own environment. On-premises, no cloud dependency, and full infrastructure ownership.

Platform teams

Platforms powering creative tools with multiple AI models. Upgrade your video output with a best-in-class engine. Improve generation quality, retain users, and differentiate with a model built for production, not prototypes.

How it works

Input

LTX accepts multiple conditioning signals at once. Text is the primary control layer. Images, audio, and keyframes act as additional dimensions that refine and constrain the output.

Technical characteristics:

Text prompt (required) — Natural-language description of actions, scenes, camera behavior, and visual style. The primary signal that drives narrative structure and visual flow.
Reference images (optional) — One or more images to anchor visual composition, style, color palette, or subject identity. Use to maintain consistency across generations or match a specific creative direction.
Audio input (optional) — Voice, music, or sound effects that condition motion timing, pacing, and scene transitions. Synchronizes the visual output to rhythm and spoken narrative.
Keyframes (optional) — Specific frames that define start states, end states, or intermediate composition targets. Frame-level control without manual animation.‍
Generation parameters (optional) — Resolution up to 4K, frame rate up to 50 FPS, duration up to ~20 seconds, seed for reproducibility, and inference steps for quality control.

Try LTX Models

Output

A single coherent video that reflects all conditioning inputs. Text drives scene structure, image conditioning maintains visual identity, audio aligns motion to sound. All signals work together.

Technical characteristics:

Format — MP4, ready for playback, editing, or pipeline integration.
Resolution — Up to 3840x2160 native 4K. Generated at target resolution, not upscaled.
Frame rate — Up to 50 FPS for smooth, cinematic motion.
Duration — Up to ~20 seconds per generation. Longer sequences through chained conditioned generations.‍
Quality — Cinematic-grade output with strong temporal consistency, style coherence, and high fidelity to all conditioning inputs.

Try LTX Models

Conditional Generative Model Pricing

See All Plans

Image-to-Video

LTX-2.3

Pro

For detailed, stable motion derived from a still image. Best for high-quality sequences, storytelling, and production use.

URL path:

/v1/image-to-video

Pricing:

1920×1080 — $0.08/sec
2560×1440 — $0.16/sec
3840×2160 — $0.32/sec

Notes:

Uses the Pro rendering path for maximum fidelity.
Ideal when visual consistency is critical.

Get Started

Text-to-Video

LTX-2

Pro

Optimized for higher fidelity and increased temporal stability. Best for production-ready output and final renders.

URL path:

/v1/text-to-video

Pricing:

1920×1080 — $0.06/sec
2560×1440 — $0.12/sec
3840×2160 — $0.24/sec

Notes:

Deal for client-facing content or polished deliverables.
Higher compute level → higher visual quality.

Get Started

FAQs

What is a conditional generative model?

A conditional generative model produces output based on structured inputs that guide the generation process. Unlike unconditional models that generate from random noise alone, conditional models accept signals like text, images, and audio to control what gets created. LTX uses all three to produce coherent video.

How does LTX use conditional generation?

LTX interprets a text prompt as the primary conditioning signal, translating descriptions of actions, scenes, camera movement, and visual style into video. Reference images, keyframes, and audio refine and constrain the output further. All signals work together to give you precise control over the result.

What conditioning inputs does LTX accept?

LTX accepts text prompts (required), reference images, audio inputs, and keyframes. These can be used individually or in combination. Text controls narrative and motion, images anchor style and composition, audio drives timing and pacing.

What is the difference between conditional and unconditional generation?

Unconditional generation produces output from random noise with no user control. Conditional generation uses structured inputs to steer the model toward a specific target. In video, that is the difference between a random clip and a scene that matches your creative brief.

What video quality can I expect?

LTX generates video at up to native 4K resolution (3840x2160) and 50 FPS, with cinematic-grade motion, strong temporal consistency, and high fidelity to all conditioning inputs.

Can I condition video generation on audio?

Yes. LTX supports audio-conditioned generation where voice, music, or sound effects influence motion timing, pacing, and scene transitions. Combine with text and image inputs for synchronized, multimodal generation.

How does LTX compare to other conditional generative models?

LTX is built for production deployment, not research demonstration. It prioritizes native 4K output, strong prompt adherence, scalable inference, and predictable conditioning behavior. Open weights allow direct inspection, fine-tuning, and custom pipeline development, unlike closed-source alternatives.

Conditional Generative Model For Video

Key Capabilities

Multi-signal conditioning

Precise controllability

Stable, repeatable results

Directed creative production

Brand-consistent marketing content

Audio-driven visual storytelling

Research & model evaluation

How it works

Input:

Output:

Built For

How it works

Input

Output

Conditional Generative Model Pricing

Image-to-Video

Text-to-Video

About LTX Models

FAQs

What is a conditional generative model?

How does LTX use conditional generation?

What conditioning inputs does LTX accept?

What is the difference between conditional and unconditional generation?

What video quality can I expect?

Can I condition video generation on audio?

How does LTX compare to other conditional generative models?

Products

Company

Resources

Social

Legal

Legal