Production

A Guide To LTX-2.3 Audio: How Synchronized Audio-Video Generation Works

Learn how LTX-2.3 generates synchronized audio and video in a single model, with audio-to-video workflows, pipeline setup, and quality tips.

LTX Team
Start Now
A Guide To LTX-2.3 Audio: How Synchronized Audio-Video Generation Works
Table of Contents:
Key Takeaways:
  • LTX-2.3 is the first DiT-based model to generate synchronized audio and video in a single pass, using a dual-stream architecture where both modalities share 48 transformer blocks with bidirectional cross-modal attention — eliminating the drift of sequential bolt-on audio pipelines.
  • The A2VidPipelineTwoStage accepts an audio file as conditioning input and generates matching video, returning your original audio waveform unmodified alongside the generated visuals — the model drives visual motion from the audio signal, not the reverse.
  • Set modality_scale above 1.0 (start at 3.0) for tighter audio-visual sync, include explicit sound cues in text prompts to direct audio generation, and ensure input audio is clean since distortion propagates through the mel-spectrogram encoding into the output.

Most AI video generators treat audio as an afterthought. Generate the video first, then bolt on audio with a separate model, a separate pipeline, and a separate set of problems. The sync is approximate. The sonic character rarely matches the visual tone. You end up spending as much time fixing audio drift as you did generating the original video.

LTX-2.3 takes a different approach. Audio and video are generated simultaneously in a single diffusion pass. The model doesn't generate audio to match video or video to match audio — it generates both together, from the same shared representation. This guide explains how that architecture works, what it means in practice, and how to get the best results from synchronized audio-video generation.

Architecture Overview: How LTX-2.3 Generates Audio and Video Together

The Shared Transformer Architecture

LTX-2.3 is a diffusion transformer (DiT) model with two streams that share 48 transformer blocks. The video stream has 14 billion parameters. The audio stream has 5 billion. Both streams process their respective modalities in parallel, and the cross-stream attention within those shared blocks allows audio and video to condition each other during generation.

What this means in practice: the model doesn't decide what the audio should sound like after it knows what the video looks like. Both are denoised together from the same noise state, with each stream influencing the other through every denoising step.

Input Processing

The video input goes through a Video VAE that applies spatial compression (8x8) and temporal compression (8 frames per latent) before it enters the transformer. The audio input goes through an Audio VAE. The Audio VAE encodes audio as mel spectrograms, compresses them, and passes the compressed representation to the audio stream. A HiFi-GAN vocoder handles the final conversion from audio latents back to waveform.

Both VAEs produce latent representations that occupy the same transformer architecture. The positional encoding is 3D for the video stream (spatial x, spatial y, temporal) and 2D for the audio stream (frequency and temporal). The shared temporal dimension is what allows the model to learn tight synchronization between visual and acoustic events.

Audio VAE and HiFi-GAN Details

The Audio VAE processes audio at 16 kHz input, compressed through a multi-scale encoder. The encoder produces a compact representation of the audio's temporal and spectral structure. During generation, the decoder reconstructs the mel spectrogram from the audio latents, and HiFi-GAN converts the spectrogram back to a full-bandwidth waveform at 24 kHz. This is a standard vocoder architecture applied to the specific demands of synchronized video generation.

What "Synchronized" Actually Means

Synchronization in audio-video generation is not just about timing. A video of footsteps that generates footstep sounds, timed correctly, is synchronized at the event level. But events in the real world have acoustic characters that depend on material properties, spatial context, and the motion dynamics of the generating action. A soft footstep on carpet sounds different from a sharp one on tile. A door closing in a small room sounds different from the same door in a large space.

Because LTX-2.3's audio and video streams share transformer blocks with bidirectional cross-stream attention, the audio generation has access to the visual context during generation — not just the event timing. The model can learn associations between visual properties and acoustic properties during training, and apply them at generation time.

Whether the model captures these finer acoustic-visual relationships depends on training data and model scale. The architecture makes it possible. The prompt makes it controllable.

Audio-to-Video vs Text-to-Audio-Video: The Two Generation Modes

Text-to-Audio-Video (TI2VidTwoStagesPipeline or A2VidPipelineTwoStage)

The primary generation mode is text-to-audio-video: you provide a text prompt describing the scene, and the model generates both the video and the synchronized audio from scratch. This mode uses the full TI2VidTwoStagesPipeline with audio generation enabled, or the A2VidPipelineTwoStage for audio-driven generation.

Prompts for audio-video generation should describe both visual and sonic content explicitly:

• Visual content: scene, subject, motion, lighting, camera behavior

• Audio content: sound source, acoustic character, spatial quality, relative mix

Examples of audio-specific prompt language that improves audio quality:

• "echoing footsteps on stone" (material + acoustic environment)

• "crackling fire with distant wind" (primary sound + ambient layer)

• "dry mechanical click of a camera shutter" (timbre + material quality)

Audio-to-Video (A2VidPipelineTwoStage)

The second mode is audio-to-video generation: you provide a reference audio clip, and the model generates video synchronized to it. This is useful for music visualization, generating video to accompany a voiceover, or creating visual content that matches a specific sound sequence.

The reference audio is encoded by the Audio VAE and used as conditioning for both the video and audio streams. The generated video reflects the temporal structure of the input audio, though the exact correspondence depends on audio content and prompt guidance.

Practical Settings for Audio-Video Generation

Frame Rate and Audio Duration Alignment

LTX-2.3 video frame counts must satisfy (F-1) % 8 == 0. Valid frame counts: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97. At 25 fps, 97 frames equals approximately 3.88 seconds of video.

Audio duration should match video duration. If you are doing audio-to-video generation with a reference audio clip, trim the audio to match your target frame count before running the pipeline. Mismatched durations cause the model to either truncate audio or generate video beyond the audio's scope, both of which degrade synchronization quality.

Audio Prompt Engineering

The model uses the text prompt to condition both streams. For better audio quality, treat audio description as a first-class element of the prompt, not an afterthought appended at the end:

• Put audio description in the same sentence as the visual event it accompanies

• Describe sound source and acoustic context together: "waves crashing on a rocky shore with hollow resonance"

• For ambient audio, specify both the primary sound and the spatial character: "forest at night with distant crickets and a slight reverb from the tree cover"

Using the Distilled Model for Audio Generation

The distilled model checkpoint (ltx-2.3-22b-distilled.safetensors) works for audio-video generation through the A2VidPipelineTwoStage. The distilled model's two-stage generation (8 predefined sigmas in stage 1, 4 in stage 2) applies to the combined audio-video generation. Audio quality from the distilled model is slightly lower than from the dev checkpoint, but the speed advantage (inference in 12 steps vs 40+) makes it appropriate for rapid iteration.

Common Issues and How to Address Them

Audio-Video Drift

If generated audio appears temporally offset from corresponding visual events, verify that frame count and audio duration are aligned before generation. Drift most commonly occurs when the audio reference clip used for conditioning has a different duration than the target video length. Use valid LTX-2.3 frame counts and match your audio clip to those durations.

Low Audio Quality

Indistinct or low-quality audio generation usually comes from underspecified prompts. If the text prompt doesn't describe the acoustic environment, the model generates generic ambient audio that may not match the visual context well. Add material-specific and spatial descriptors to the audio portion of your prompt.

Synchronized but Semantically Mismatched Audio

If the audio is synchronized at the event level but sounds wrong for the visual content (e.g., correct timing but wrong acoustic character), the prompt may not be specific enough about material properties or spatial context. Try adding environment descriptors: "in a large hall" vs "in a small room", "on dry pavement" vs "on wet gravel". These guide both the visual and acoustic generation toward consistent acoustic character.

Conclusion

Synchronized audio-video generation works because LTX-2.3's shared transformer architecture denoises both modalities simultaneously, allowing each to condition the other at every generation step. The result is audio that is acoustically coherent with the visual content rather than bolted on after the fact. Getting the best results requires treating audio description as a first-class element of your prompts, matching frame counts to valid LTX-2.3 values, and aligning audio and video durations before running audio-to-video generation.

The LTX-2 repository includes full pipeline documentation for both A2VidPipelineTwoStage and TI2VidTwoStagesPipeline with audio enabled.

No items found.