- LTX-2.3's A2VidPipelineTwoStage takes an audio file and optional reference image as inputs, generating a talking avatar where lip movements and facial motion are driven by the audio signal through cross-modal attention across 48 transformer blocks.
- The key parameter for lip sync quality is
modality_scale— set it above 1.0 (start at 3.0) to enforce audio-visual coherence, with separate cfg_scale values for the audio guider (7.0) and video guider (3.0) to balance prompt adherence against natural motion. - The open-source pipeline runs on 32GB GPUs with FP8 quantization, giving full control over avatar appearance, generation parameters, and output at scale — without the fixed avatars and per-video pricing of other hosted platforms.
Turning a voice recording into a video of a talking person used to require motion capture rigs, manual lip sync animation, or a monthly subscription to a hosted platform. Audio to video AI has changed the equation: give a model an audio file and a reference image, and it generates a video where the avatar speaks in sync with the audio.
LTX-2.3 includes the A2VidPipelineTwoStage, a dedicated pipeline that takes an audio file as its primary conditioning input and generates video where motion, lip movements, and visual rhythm follow the audio signal. This guide walks through building a talking avatar from audio, from preparing your inputs to tuning lip sync quality.
What Are AI Talking Avatars?
An AI talking avatar is a video of a person (or character) speaking, generated entirely by a model rather than captured by a camera. The avatar's lip movements, facial expressions, and head motion are driven by an input audio file. Unlike text-to-speech avatar tools that generate both the voice and the video from text, audio to video AI takes an existing audio recording and creates the visual component to match it.
How Audio-to-Video AI Works
The process starts with converting the audio signal to a mel-spectrogram and encoding that spectral representation into a latent the diffusion model can use as conditioning input. In LTX-2.3, the Audio VAE processes a mel-spectrogram of the input audio and encodes it into latent space, and this latent is used alongside a text prompt and optional image conditioning to guide the video generation process. The model learns the relationship between audio features (phonemes, rhythm, energy) and visual motion patterns (mouth shapes, head tilt, expressions), producing frames where the avatar appears to speak the audio naturally.
SaaS Avatars vs Open-Source Pipelines
Hosted platforms like Synthesia and HeyGen offer talking avatar creation through a web interface, but come with limitations: fixed avatar appearances, per-video pricing, no access to the underlying model, and content policies that restrict certain use cases. Building your own pipeline with an open-source model gives you full control over the avatar's appearance, generation parameters, and output quality. Your avatar designs and generated output stay on your hardware, you can fine-tune the model and integrate it into production workflows, and generating at scale costs only GPU time.
How LTX-2.3 Handles Audio-to-Video Generation
LTX-2.3's audio-to-video capability is built into the A2VidPipelineTwoStage, a two-stage pipeline specifically designed for audio-conditioned video generation.
The A2VidPipelineTwoStage Architecture
The pipeline operates in two stages. Stage 1 generates video at half resolution with audio conditioning. During this stage, the model performs video-only denoising with the audio latent frozen, meaning the audio representation stays fixed while the video is synthesized to match it. Stage 2 upsamples the result to 2x resolution and refines the video while keeping the audio fixed, using a distilled LoRA for faster processing. The original audio waveform is passed through and returned in the output to preserve fidelity, so the final video retains the exact audio you provided.
Audio VAE and Audio Conditioning
The Audio VAE first converts your input audio to a mel-spectrogram, then encodes that spectral representation into a latent that serves as the initial audio latent for generation. This encoding captures the temporal structure of the audio: speech patterns, pauses, emphasis, and rhythm. The model's 48 transformer blocks process both video and audio tokens through bidirectional cross-modal attention, which means the video generation is continuously informed by the audio signal at every step of the denoising process. The result is temporal consistency between what the viewer sees and what they hear.
Image Conditioning for Avatar Appearance
Adding a reference image gives you precise control over how the avatar looks. The image is encoded and used to condition the generation at specific frames, anchoring the avatar's appearance to your reference. This is how you create consistent, recognizable talking avatars rather than random faces that vary between generations.
Step-by-Step: Build a Talking Avatar from Audio
Prerequisites
Before running the pipeline, you need four things: the LTX-2.3 model checkpoint (ltx-2.3-22b-dev.safetensors), the spatial upsampler (upsampler.safetensors), the Gemma text encoder, and a distilled LoRA. Clone the repository and install dependencies:
git clone https://github.com/Lightricks/LTX-2.git && cd LTX-2 && uv sync --frozen
System and GPU requirements: CUDA 13+ is required. The full model targets GPUs with 80GB+ VRAM. With FP8 quantization (--quantization fp8-cast), the distilled variant runs on 32 GB GPUs.
If you don't have access to a high-VRAM GPU, the LTX API provides a managed audio-to-video endpoint at docs.ltx.video that handles generation server-side without any local hardware setup.
Step 1: Prepare Your Audio Input
The A2VidPipelineTwoStage accepts audio files through the --audio-path argument. Two additional parameters control which portion of the audio to use: --audio-start-time sets the offset into the file (in seconds) and --audio-max-duration limits the duration. For a talking avatar, a clean voice recording without background music or noise produces the best results. The model extracts features from the audio to drive lip movements, so clarity in the voice signal directly affects sync quality.
Accepted audio formats include WAV (with AAC-LC, MP3, Vorbis, or FLAC codecs), MP3, M4A (AAC-LC), and OGG (Opus or Vorbis). PCM-encoded WAV files are not supported; use a compressed codec like AAC-LC or FLAC instead.
Step 2: Create or Select a Reference Image
Your reference image defines the avatar's appearance. Use a clear, well-lit photograph or illustration of the face you want the avatar to use. The image conditioning system encodes this image and uses it to anchor the generated video's visual identity. For consistency across multiple generations, use the same reference image and keep the random seed constant.
Step 3: Configure and Run the A2Vid Pipeline
Run the pipeline from the command line with your audio file, reference image, and a text prompt that describes the desired scene:
python -m ltx_pipelines.a2vid_two_stage --checkpoint-path /path/to/ltx-2.3-22b-dev.safetensors --distilled-lora /path/to/distilled_lora.safetensors 0.8 --spatial-upsampler-path /path/to/upsampler.safetensors --gemma-root /path/to/gemma --audio-path /path/to/speech.wav --prompt "A person speaking directly to the camera in a well-lit studio" --output-path talking_avatar.mp4 --num-frames 97 --frame-rate 25
The frame count must satisfy (F-1) % 8 == 0 (valid counts: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97). For a 97-frame generation at 25fps, the output is approximately 3.9 seconds. Match the audio duration to the frame count to avoid trailing silence or clipped speech.
Step 4: Evaluate and Iterate
Assess lip sync accuracy (mouth movement matching audio phonemes), temporal smoothness (no jumps or flickers between frames), and visual identity consistency throughout the clip. If any of these need improvement, adjust the guidance parameters described in the next section.
Optimizing Lip Sync Quality
Multimodal Guidance Parameters for Audio-Video Sync
The multimodal guidance system gives you independent control over how the model balances text, audio, and video coherence. The key parameter for lip sync is modality_scale, which steers the model away from unsynced video and audio results. Setting it above 1.0 (start with 3.0) improves audio-visual coherence. When generating audio-driven video, this parameter matters more than cfg_scale.
Audio Guider vs Video Guider Settings
LTX-2.3 uses separate guider configurations for video and audio modalities. For talking avatar generation, consider these starting values:
The higher cfg_scale on the audio guider (7.0 vs 3.0) strengthens prompt adherence on the audio side while keeping video motion natural. Increase modality_scale if you see the avatar moving but not in sync with the speech.
Common Lip Sync Issues and Fixes
If the avatar's mouth moves but does not match the words, the modality_scale is too low. Increase it in increments of 1.0. If the video looks static or the avatar barely moves, cfg_scale on the video guider is too high, over-constraining the text conditioning at the expense of audio-driven motion. Lower it to 2.0-2.5. If you notice temporal flickering, increase stg_scale to 1.5 to strengthen spatio-temporal coherence, or add block 28 to stg_blocks for broader perturbation coverage.
Advanced Techniques
Prompt Enhancement for Better Avatar Motion
The text prompt complements the audio conditioning. Instead of describing what the avatar says, describe how they look and move: "A person speaking enthusiastically to the camera, subtle hand gestures, warm studio lighting." The audio drives the speech; the prompt drives the scene, lighting, and non-speech motion.
Combining Audio-to-Video with IC-LoRA
After generating your talking avatar, you can use the ICLoraPipeline for video-to-video style transfer. This lets you create talking avatars in illustration styles, anime aesthetics, or brand-specific visual treatments while keeping the lip sync from the original generation.
Using FP8 Quantization for Local Inference
Running the full 22B-parameter model requires significant VRAM. FP8 quantization reduces the memory footprint by storing transformer weights in 8-bit format. Add --quantization fp8-cast to your CLI command and set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to enable it.
Use Cases for AI Talking Avatars
Education and training benefit from consistent presenters. Generate a talking avatar once, then produce new lessons by swapping the audio file. Multilingual delivery becomes straightforward: record the same script in different languages and generate a matching avatar for each. Marketing teams can produce personalized product demos at scale, where 100 personalized videos cost the same per-unit compute as one. Accessibility applications include generating avatar companions for audio-only content and creating dubbed versions where lip movements match the dubbed language.
Building a talking avatar pipeline with LTX-2.3 gives you audio to video AI generation without platform dependencies. The A2VidPipelineTwoStage handles core generation, multimodal guidance parameters let you tune lip sync quality, and the open-source architecture means every part of the pipeline is customizable.
