- LTX-2 introduces native audio-to-video generation via a dual-stream diffusion transformer architecture that synchronizes video to audio during generation rather than aligning them in post-production.
- The A2VidPipelineTwoStage accepts an audio file plus a text prompt, with synchronization quality controlled primarily by the
modality_scaleparameter — higher values produce tighter audio-visual sync. - Available via CLI and Python API, the pipeline supports WAV, MP3, M4A, and OGG inputs and can be combined with image conditioning to anchor visual identity while audio drives motion and timing.
Creating video that matches audio has traditionally required frame-by-frame manual alignment. Whether you are producing a music visualization, generating visuals for a podcast, or building audio-reactive content for social media, the process of synchronizing video to an audio track demands significant manual effort. The audio exists, the vision exists, but bridging the two means either tedious keyframing or accepting that the video and audio will feel disconnected.
LTX-2 changes this with native audio-to-video generation. As the first DiT-based audio-video foundation model, LTX-2 processes audio and video through a unified architecture rather than treating them as separate tracks to be stitched together after the fact. The A2VidPipelineTwoStage accepts an audio file as input and generates video that is temporally synchronized to that audio from the start.
This tutorial covers how the audio-to-video pipeline works, how to run it from the CLI and Python API, how to tune the guidance parameters that control audio-visual synchronization, and practical strategies for getting the best results across different use cases.
How Audio-to-Video Generation Works in LTX-2
The A2VidPipelineTwoStage is a two-stage generation pipeline that takes an audio file as its primary conditioning signal. In Stage 1, the pipeline generates video at half the target resolution with audio conditioning. During this stage, the model performs video-only denoising while the audio latents remain frozen, ensuring that the visual generation is driven by the audio signal rather than developing independently. Stage 2 then upsamples the result by 2x and refines the video while keeping the audio fixed, using a distilled LoRA for the refinement pass.
The audio input follows a specific encoding path. The input audio is converted to mel-spectrograms and encoded via the audio VAE into latent space, which serves as the initial audio latent for the diffusion process. The original audio waveform is passed through and returned in the output to preserve fidelity. This means the audio you hear in the final output is your original audio, not a re-synthesis.
The Dual-Stream Architecture
LTX-2 is built as an asymmetric dual-stream diffusion transformer with 14 billion parameters for video and 5 billion for audio. These two streams jointly process 48 shared transformer blocks with cross-modal attention for synchronization. The audio stream uses 1D temporal RoPE positional encoding while the video stream uses 3D RoPE, and both are decoded through their respective VAEs. The audio VAE produces mel-spectrograms decoded through a HiFi-GAN vocoder to 24 kHz stereo output.
This shared-block architecture is what enables true audio-to-video generation rather than separate generation followed by alignment. The cross-modal attention ensures that the video frames are conditioned on the audio content at every step of the diffusion process.
Running Audio-to-Video from the CLI
The A2VidPipelineTwoStage is available as a command-line module, making it straightforward to integrate into scripted workflows. Here is the basic command structure:
python -m ltx_pipelines.a2vid_two_stage \
--checkpoint-path path/to/ltx-2.3-22b-dev.safetensors \
--distilled-lora path/to/distilled_lora.safetensors 0.8 \
--spatial-upsampler-path path/to/upsampler.safetensors \
--gemma-root path/to/gemma \
--audio-path path/to/input_audio.wav \
--audio-start-time 0.0 \
--audio-max-duration 5.0 \
--prompt "A drummer performing on stage under blue spotlights, energetic crowd in the background" \
--output-path output_a2v.mp4
The three arguments specific to the audio-to-video pipeline are --audio-path (the input audio file), --audio-start-time (where to begin reading the audio, in seconds), and --audio-max-duration (the maximum duration of audio to use). The pipeline also accepts a text prompt that describes the visual content you want generated alongside the audio.
The pipeline accepts WAV, MP3, M4A, and OGG audio files. Note that PCM-encoded WAV files are not supported; use a compressed codec such as AAC-LC or FLAC.
Prerequisites
Before running the pipeline, download the required model components from the LTX-2 open-source repository:
• CUDA 13+ is required to run LTX-2
• LTX-2.3 model checkpoint (either ltx-2.3-22b-dev.safetensors for maximum quality or ltx-2.3-22b-distilled.safetensors for faster inference)
• Spatial upscaler (ltx-2.3-spatial-upscaler-x2-1.0.safetensors) for the Stage 2 upsampling pass
• Distilled LoRA (ltx-2.3-22b-distilled-lora-384.safetensors) for the refinement stage
• Gemma 3 text encoder for processing text prompts
The Video VAE requires frame counts that satisfy (F-1) % 8 == 0 (valid counts: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97). When using --audio-max-duration, the pipeline selects a valid frame count automatically.
You can view all available options by running:
python -m ltx_pipelines.a2vid_two_stage --help
Using the Pipeline Programmatically
For developers building custom audio-to-video workflows or batch processing systems, the A2VidPipelineTwoStage is accessible through the Python API. This gives you full control over guidance parameters, conditioning inputs, and generation settings.
The pipeline accepts the same classifier-free guidance parameters as other LTX-2 pipelines through MultiModalGuiderParams. The key difference for audio-to-video is that you configure guidance independently for both the video and audio modalities:
from ltx_core.components.guiders import MultiModalGuiderParams
# Video guider: moderate CFG, STG enabled, modality isolation
video_guider_params = MultiModalGuiderParams(
cfg_scale=3.0,
stg_scale=1.0,
rescale_scale=0.7,
modality_scale=3.0,
stg_blocks=[29],
)
# Audio guider: higher CFG for stronger prompt adherence
audio_guider_params = MultiModalGuiderParams(
cfg_scale=7.0,
stg_scale=1.0,
rescale_scale=0.7,
modality_scale=3.0,
stg_blocks=[29],
)
Combining Audio with Image Conditioning
The A2VidPipelineTwoStage supports image conditioning alongside audio input. You can provide an image at a specific frame to anchor the visual generation while the audio drives the temporal dynamics. This is particularly useful when you have a reference frame (such as a character portrait or scene establishing shot) and want the model to animate it in sync with the audio.
The pipeline supports image conditioning at specific frames to anchor the visual identity. The audio conditioning and image conditioning work together, with the image anchoring the visual identity and the audio driving the motion and timing.
Tuning Audio-Visual Synchronization
The quality of audio-visual sync in the generated video is primarily controlled by three guidance parameters. Understanding how each one works gives you precise control over the relationship between the audio input and the visual output.
The modality_scale Parameter
This is the most important parameter for audio-to-video generation. modality_scale steers the model away from unsynced video and audio results, improving audio-visual coherence. When generating video with audio, set modality_scale greater than 1.0 (a value of 3.0 is a good starting point) to improve audio-visual sync. Setting it to 1.0 disables modality guidance entirely, which means the model will not actively enforce synchronization between the two streams.
Higher values produce tighter sync but may reduce visual diversity. Lower values give the model more creative freedom in the visual output but risk the video drifting from the audio's rhythm and dynamics.
Balancing cfg_scale for Audio vs Video
cfg_scale controls classifier-free guidance, which determines how strongly the output adheres to the text prompt. Typical values range from 2.0 to 5.0. For audio-to-video, you set this independently for the video and audio guiders. A higher cfg_scale on the audio guider (for example, 7.0) ensures the audio generation adheres closely to the prompt, while a moderate value on the video guider (for example, 3.0) provides a balance between prompt adherence and natural motion.
Using stg_scale for Temporal Coherence
stg_scale controls spatio-temporal guidance, which improves temporal coherence by perturbing specific transformer blocks (set via stg_blocks, typically [29] for the last block) and steering generation away from the perturbed prediction. Typical values range from 0.5 to 1.5. For audio-driven content where visual consistency between frames matters, keeping stg_scale at 1.0 or above helps prevent visual drift across the generated sequence.
The rescale_scale parameter (typical values: 0.5 to 0.7) complements these by rescaling the guided prediction to match the variance of the conditional prediction, which helps prevent over-saturation in the generated frames.
Practical Use Cases
Audio-to-video generation fits naturally into several production workflows. For music visualization, feed the pipeline an audio clip and describe the visual aesthetic you want. The model generates video where motion and lighting respond to the audio content. Write prompts that describe mood and visual style rather than literal interpretations of the music.
For social media content, use the --audio-max-duration parameter to select a 5 to 10 second segment of your audio and generate pre-synchronized video clips. This eliminates the post-production step of manually aligning video cuts to audio beats. For spoken-word content like podcasts, combine audio conditioning with image conditioning to maintain a consistent visual identity while the motion and scene details are driven by the audio dynamics.
Tips for Best Results
Write text prompts that complement the audio. A mismatch between prompt tone and audio character will produce results where the model struggles to reconcile the two conditioning signals. Use --audio-start-time and --audio-max-duration to select audio segments with clear rhythmic structure or dynamic variation.
When experimenting, use the distilled model for faster feedback. The distilled pipeline runs with 8 predefined sigmas and provides quick results. Switch to the dev model for the final render. For lower GPU memory, use the --quantization fp8-cast flag with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Getting Started
The A2VidPipelineTwoStage is available in the LTX-2 open-source repository alongside all other pipelines. If you have already set up LTX-2 for text-to-video or image-to-video generation, you have the same model components needed for audio-to-video. Point the pipeline at your audio file, write a prompt that describes the visual content you want, configure the modality guidance for tight audio-visual sync, and start generating.
For developers looking to improve output quality, the artifact reduction guide covers complementary techniques for reducing visual artifacts that apply to audio-to-video generation as well.
