Conditional Generative Model For Video

Generate video from text, images, and audio. LTX translates structured inputs into coherent, production-grade visual output.

//

Key Capabilities

  • Multi-signal conditioning

    Text prompts, reference images, audio inputs, and keyframes. Each signal shapes a different dimension of the output. Language drives narrative, images anchor composition, audio defines rhyth
  • Precise controllability

    Camera behavior, motion dynamics, visual style, and scene composition all respond directly to structured input. Conditioning inputs don't just influence the output, they control it.
  • Stable, repeatable results

    Same inputs, predictable outputs across every run. LTX stays reliable for enterprise pipelines, iterative workflows, and scalable deployment.

Directed creative production

Generate video with precise creative control by conditioning on scripts, mood boards, and reference frames. Text drives the narrative, image conditioning anchors the visual identity, and structured inputs replace manual keyframing.

Brand-consistent marketing content

Condition video generation on brand assets, style references, and product imagery to produce content that stays visually consistent. Keep image conditioning fixed and adjust text prompts to iterate across variations fast.

Audio-driven visual storytelling

Condition video on voice-over, music, or sound design to sync visual motion with audio structure. Built for music videos, podcast visualizations, and narrative content where timing follows the track.

Research & model evaluation

Use LTX to study prompt adherence, cross-modal behavior, temporal consistency, and conditioning strength on a production-grade open-source foundation model.

How it works

Input:

  • Text prompt (required) β€” Natural-language description of actions, scenes, camera behavior, and visual style. The primary signal that drives narrative structure and visual flow.
  • Reference images (optional) β€” One or more images to anchor visual composition, style, color palette, or subject identity. Use to maintain consistency across generations or match a specific creative direction.
  • Audio input (optional) β€” Voice, music, or sound effects that condition motion timing, pacing, and scene transitions. Synchronizes the visual output to rhythm and spoken narrative.
  • Keyframes (optional) β€” Specific frames that define start states, end states, or intermediate composition targets. Frame-level control without manual animation.‍
  • Generation parameters (optional) β€” Resolution up to 4K, frame rate up to 50 FPS, duration up to ~20 seconds, seed for reproducibility, and inference steps for quality control.

Output:

  • Format β€” MP4, ready for playback, editing, or pipeline integration.
  • Resolution β€” Up to 3840x2160 native 4K. Generated at target resolution, not upscaled.
  • Frame rate β€” Up to 50 FPS for smooth, cinematic motion.
  • Duration β€” Up to ~20 seconds per generation. Longer sequences through chained conditioned generations.‍
  • Quality β€” Cinematic-grade output with strong temporal consistency, style coherence, and high fidelity to all conditioning inputs.

Built For

Production-ready video generation built for real-world deployment.

Builders

Product teams, AI startups, and developers building AI-powered video features. Add production-grade video generation as a product capability, not a research project. One API, production-ready results, and no custom orchestration.

Producers at scale

Brands, agencies, and creative teams producing high volumes of content. Turn existing assets into video at scale. Faster iteration, lower production cost, and more output from what you already have.

On-prem operators

Teams that require full control over deployment and data. Run video generation in your own environment. On-premises, no cloud dependency, and full infrastructure ownership.

Platform teams

Platforms powering creative tools with multiple AI models. Upgrade your video output with a best-in-class engine. Improve generation quality, retain users, and differentiate with a model built for production, not prototypes.

How it works

Input

LTX accepts multiple conditioning signals at once. Text is the primary control layer. Images, audio, and keyframes act as additional dimensions that refine and constrain the output.

Technical characteristics:

  • Text prompt (required) β€” Natural-language description of actions, scenes, camera behavior, and visual style. The primary signal that drives narrative structure and visual flow.
  • Reference images (optional) β€” One or more images to anchor visual composition, style, color palette, or subject identity. Use to maintain consistency across generations or match a specific creative direction.
  • Audio input (optional) β€” Voice, music, or sound effects that condition motion timing, pacing, and scene transitions. Synchronizes the visual output to rhythm and spoken narrative.
  • Keyframes (optional) β€” Specific frames that define start states, end states, or intermediate composition targets. Frame-level control without manual animation.‍
  • Generation parameters (optional) β€” Resolution up to 4K, frame rate up to 50 FPS, duration up to ~20 seconds, seed for reproducibility, and inference steps for quality control.

Output

A single coherent video that reflects all conditioning inputs. Text drives scene structure, image conditioning maintains visual identity, audio aligns motion to sound. All signals work together.

Technical characteristics:

  • Format β€” MP4, ready for playback, editing, or pipeline integration.
  • Resolution β€” Up to 3840x2160 native 4K. Generated at target resolution, not upscaled.
  • Frame rate β€” Up to 50 FPS for smooth, cinematic motion.
  • Duration β€” Up to ~20 seconds per generation. Longer sequences through chained conditioned generations.‍
  • Quality β€” Cinematic-grade output with strong temporal consistency, style coherence, and high fidelity to all conditioning inputs.

Image-to-Video

LTX-2.3
Pro

For detailed, stable motion derived from a still image. Best for high-quality sequences, storytelling, and production use.

URL path:
/v1/image-to-video
Pricing:
  • 1920Γ—1080 β€” $0.08/sec
  • 2560Γ—1440 β€” $0.16/sec
  • 3840Γ—2160 β€” $0.32/sec
Notes:
  • Uses the Pro rendering path for maximum fidelity.
  • Ideal when visual consistency is critical.

Text-to-Video

LTX-2
Pro

Optimized for higher fidelity and increased temporal stability. Best for production-ready output and final renders.

URL path:
/v1/text-to-video
Pricing:
  • 1920Γ—1080 β€” $0.06/sec
  • 2560Γ—1440 β€” $0.12/sec
  • 3840Γ—2160 β€” $0.24/sec
Notes:
  • Deal for client-facing content or polished deliverables.
  • Higher compute level β†’ higher visual quality.