- LTX-2.3 uses temporal masking via the LTXV Set Audio Video Mask By Time node to enforce lip sync — locking the audio latent forces generated video frames to physically align with the audio, driving mouth movement to match phonemes.
- Frame count must be calculated precisely before tuning any other parameter: use
audio_duration_seconds × 25, then round to the nearest valid 8n+1 value. An incorrect frame count is the most common cause of sync drift. - Keep text prompts focused on visual appearance, lighting, and character description — never describe motion or speech pacing, as these conflict with audio conditioning and degrade sync quality.
Understanding Lip Sync in AI Video Generation
Lip synchronization isn't just cosmetic. Dialogue-heavy content—voiceovers, interviews, character dialogue in narrative films—requires mouth movements that align with audio timing to within a frame or two. When audio and video drift, viewers immediately perceive the mismatch. The content feels artificial, and trust erodes.
Why is this hard? Video generation models traditionally operated in pixel space or visual latent space. Audio lived in a completely different domain—mel-spectrograms, waveforms, or audio embeddings. Conditioning one modality on the other required architectural bridges that earlier models either lacked or implemented weakly.
LTX-2.3 changed this. The model was trained on large-scale audio-video pairs with explicit conditioning on audio structure. When you feed audio into the generation pipeline, the model doesn't guess how mouths should move—it directly samples video that respects the acoustic timing of the input audio.
LTX-2.3's Lip Sync Mechanism: Temporal Masking
The core lip sync mechanism in LTX-2.3 is temporal masking via the LTXV Set Audio Video Mask By Time node. Understanding how this works is essential to achieving consistent, high-quality sync.
How Temporal Masking Works
The workflow encodes the reference image and audio into latent space separately — the image via LTXVImgToVideoInplace (or VAE Encode for image-to-video), and the audio via LTXV Audio VAE Encode. These latents are then merged using LTXVConcatAVLatent, and the combined latents are passed into the mask node with mask_audio=False and mask_video=True.
Setting mask_audio=False is the critical step: it freezes the audio latent, turning it into a direct physical constraint on the diffusion process. Because LTX-2 is a joint audio-video model, locking the audio forces the generated video frames to align with the audio — driving mouth movement to match phonemes frame by frame. This is the core mechanic that makes lip sync work.
The first sampling pass uses LTXVNormalizingSampler. For the final export step, decode the video latents and connect the original clean audio file directly to VHS Video Combine — bypassing the AI-processed audio preserves quality in the final MP4.
The MultimodalGuider's Role
The MultimodalGuider is present in the workflow and exposes per-modality control via cfg, stg, and modality_scale parameters. It is not, however, the primary lip sync control — that is temporal masking. Think of the MultimodalGuider as controlling the overall guidance balance, while LTXV Set Audio Video Mask By Time is what actually enforces audio-video alignment at the frame level.
Step-by-Step: Setting Up Audio-to-Video for Lip Sync in ComfyUI
LTX-2.3's audio-to-video workflow is built for ComfyUI. This section walks you through each step with specific node names and configurations.
Preparing Your Audio Input
Audio quality directly impacts sync quality. Before feeding audio into the model, prepare it correctly:
- Format: WAV or FLAC, 16-bit or 24-bit, mono channel. The model expects single-channel audio — stereo tracks should be mixed down to mono before input.
- Sample rate: 24 kHz is the native conditioning rate. Audio at other sample rates (44.1 kHz, 48 kHz) will be resampled automatically, but starting at 24 kHz avoids interpolation artifacts.
- Duration: Match your target video duration exactly. Mismatched duration forces the pipeline to stretch or truncate, degrading sync.
- Content: Clean dialogue or voiceover with minimal background noise. Remove room tone, breath artifacts, and long silences (>2 seconds) between phrases.
- Levels: Normalize to -3 dB to -6 dB peak. Avoid clipping (above 0 dB) or excessively quiet passages (below -30 dB).
For music or ambient audio, less pre-processing is needed. The model is more tolerant of variation in non-speech content.
Calculating Frame Count (Do This Before Anything Else)
This is the most important step in the entire workflow. An incorrect frame count is the most common cause of sync drift — more common than any parameter misconfiguration. Do not skip this.
LTX-2.3 processes audio at exactly 25 latents per second. Your frame count must be calculated to match your audio duration precisely, and it must satisfy the 8n+1 rule: valid frame counts are 1, 9, 17, 25, 33... 97, 121, 161, 257, etc. (values of the form 8n+1).
Formula:total_frames = audio_duration_seconds × 25Then round to the nearest valid 8n+1 value.
Example: 5 seconds of audio → 125 frames → round to 121 (the nearest valid 8n+1 value below 125).
Why this matters: If your frame count doesn't match the audio length, sync will drift progressively across the video — it starts aligned and slowly falls out of time. If your frame count is not a valid 8n+1 value, you'll encounter pipeline errors or unexpected output. Set this correctly before touching any other parameter.
Loading the Workflow in ComfyUI
There is no official LTX audio-to-video workflow JSON in the LTX-2 repository or the ComfyUI-LTXVideo repository. The community has filled this gap — the widely recommended starting points are Purzbeats' Audio-to-Video Extension workflow (ltx2-audio_to_video_extension_5x.json) and Benji's AI Playground tutorial, both of which are built around the correct temporal masking mechanism.
Download one of these community workflows and open it in ComfyUI.
The workflow includes pre-configured nodes for:
- Audio loading and preprocessing (resampling, mono conversion)
- Image and audio encoding into separate latent spaces
- Latent concatenation and temporal masking
- Normalized sampling for the first pass
- VAE decoding for final video output
- Clean audio connection to video combine for export
Critical nodes to identify:
- LTXVCheckpointLoader — Loads the LTX-2.3 checkpoint (Distilled or Full)
- LTXV Audio VAE Encode — Encodes raw audio into the latent space
- LTXVConcatAVLatent — Merges image and audio latents
- LTXV Set Audio Video Mask By Time — The core lip sync control (set
mask_audio=False,mask_video=True) - LTXVNormalizingSampler — The sampling node for the first pass
- VHS Video Combine — Final export node; connect your original clean audio here directly
Configuring the Audio Conditioning Pipeline
Once the workflow is open, configure these settings:
Audio and Image Encoding:
- Load your prepared WAV/FLAC file using the LoadAudio node
- Connect it to the LTXV Audio VAE Encode node
- Load your reference image and connect it to LTXVImgToVideoInplace (or VAE Encode for I2V)
- Merge both encoded latents using LTXVConcatAVLatent
Temporal Masking Configuration:
- Connect the combined latents to LTXV Set Audio Video Mask By Time
- Set
mask_audio=False— this freezes the audio latent as a physical constraint - Set
mask_video=True - This configuration is what drives lip sync
Frame Count:
- Calculate your frame count using the formula above before proceeding
- Set the frame count in the sampler node to your calculated 8n+1 value
- Verify this is correct before rendering
Frame Rate Configuration:
- Default frame rate is 24 fps, which matches the model's training data distribution
- 24 fps provides the best lip sync accuracy
- Set frame rate in the sampler or video output node — ensure it matches the rate you used to calculate total frames
Final Export:
- In the VHS Video Combine node, connect your original clean audio file directly — do not use the AI-processed audio output. This preserves audio quality in the final MP4.
Tuning Parameters for Optimal Mouth Synchronization
Once your workflow is correctly configured with temporal masking and the right frame count, parameter tuning refines output quality.
Model and Sampling Configuration
For audio-to-video lip sync specifically, community testing has validated the following:
- Use the distilled model — not the full model. For this workflow, the distilled model at higher CFG and more steps outperforms the general guidance would suggest.
- CFG: 4 — this is the recommended value for audio-to-video. It yields the best prompt adherence and tightest lip sync.
- Steps: 40 — the validated step count for audio-to-video with the distilled model.
Note: these are workflow-specific findings. The general guidance for LTX-2.3 (full model, lower CFG) does not apply here — the audio-to-video pipeline behaves differently.
Classifier-Free Guidance (CFG)
Recommended for lip sync content: 4
CFG affects how the model balances prompt instructions with audio conditioning. For audio-to-video workflows, CFG 4 with the distilled model produces the best results. Stay within the documented range: the ceiling for the full model is 5.0; anything above that is outside the documented operating range and should be avoided.
Eta (Denoising Strength): Refinement Control
Eta controls how much refinement happens during the sampling process.
Tested range for audio-video sync: 0.1–0.3. Community testing suggests 0.2 is a reliable starting point.
MultimodalGuider Parameters
The MultimodalGuider exposes cfg, stg, and modality_scale per modality. These are not lip sync controls — lip sync is controlled by the temporal masking step. Adjust these only if you have a specific reason to tune the guidance balance; leave them at defaults if sync quality is your primary concern.
Norming Threshold: Audio Influence Calibration
Norming threshold calibrates how strongly audio conditioning influences the generation process.
Typical range: 0.1–0.5
If you notice sudden jumps in mouth position or temporal artifacts, adjusting norming threshold can smooth them out. Start with default (typically 0.2–0.3) and only adjust if you observe issues.
Parameter Tuning Workflow
Start with these defaults (distilled model):
- CFG: 4
- Steps: 40
- Eta: 0.2
Generate a preview: Render at low resolution (480p) with a short clip duration and watch the output carefully.
If lip sync is tight and natural: You're done. Scale to final resolution.
If audio lags behind video (mouth movements come early): Verify your frame count first — this is the most likely cause. Recalculate and re-render before adjusting any other parameter.
If video lags behind audio (mouth movements come late): Again, check frame count first. If frame count is correct, verify mask_audio=False is set correctly in LTXV Set Audio Video Mask By Time.
If mouth movement looks stiff or unnatural: Increase eta slightly (to 0.25). Regenerate.
If temporal artifacts appear (frame jumps, repeated frames): Increase norming threshold to 0.3–0.4. Regenerate.
Practical Workflow: Text Prompt Strategy for Lip Sync Content
Your text prompt shapes everything from character appearance to lighting to background. But for lip sync content, your prompting strategy must prioritize visible mouth movement and facial expression over motion descriptions.
What to Include in Prompts
Character mouth and face: Describe mouth shape, teeth visibility, and lip texture. This helps the model generate detailed mouth regions that sync well with audio.
Example: "Close-up of a woman's face, lips slightly parted, natural skin texture, neutral expression ready to speak."
Dialogue context: Hint at the emotional tone or intensity of speech. Is dialogue urgent, calm, emotional, or informative?
Example: "Actor delivering intense, urgent dialogue with visible emotion in facial expression."
Visual style and lighting: Define how the scene looks—professional studio lighting, natural outdoor light, cinematic color grading.
Example: "Soft studio lighting, warm color grading, shallow depth of field on face."
Background and environment: Keep this minimal for dialogue-focused scenes. A cluttered background distracts from mouth movement.
Example: "Neutral blurred background, focus sharp on face."
Full example prompt for dialogue scene:
Close-up of man's face, mouth naturally positioned, professional studio lighting with soft key light, neutral background, warm color grading, ready to deliver dialogue with clear articulation and visible lip movement. Natural skin texture, minimal makeup, professional appearance.
What to Avoid in Prompts
Do not describe motion, camera movement, or temporal action in dialogue-heavy lip sync content. The temporal masking mechanism already defines temporal behavior. Prompt descriptions of motion conflict with audio guidance and degrade sync quality.
Avoid these:
- "A person speaking quickly with animated gestures"
- "Camera slowly zooms in as the speaker delivers the line"
- "The character nods while talking and then turns away"
- "Rapid speech with excited hand movements"
- "Slow pan across the speaker's face as they speak softly"
Why? These motion descriptions are text-based. The audio input defines actual timing. If your prompt says "speaks rapidly" but your audio contains slow, measured speech, the conflict degrades output quality.
Instead, let audio define the pacing and motion, and use prompts purely for visual appearance.
Concrete Example Prompts
Scenario: Professional voiceover / narrator
Professional male voice actor, close-up of face from shoulders up, professional broadcast studio setting, crisp studio lighting, neutral background, dark suit and tie, serious professional expression, clear articulation visible in mouth movement, high-definition broadcast quality, warm professional color grading.
Scenario: Emotional dialogue / character acting
Woman's face in close-up, natural theatrical lighting, expressive eyes and mouth ready for emotional performance, neutral background with warm amber glow, beautiful natural skin, soft lighting on face, professional film quality, moment of emotional intensity visible in facial readiness.
Scenario: Educational content / instruction
Instructor at desk or lectern, face clearly visible, friendly and approachable expression, bright natural studio lighting, professional educational setting background, crisp video quality, mouth articulation clear and visible for educational delivery.
Troubleshooting: Common Lip Sync Issues and Solutions
Even with correct configuration, edge cases occur. This section diagnoses common issues and provides targeted fixes.
Problem: Mouth Out of Sync with Dialogue
Diagnosis: Mouth movements visibly lag or lead the audio. Phonemes (mouth shapes) don't match speech sounds.
Causes (check in this order):
- Incorrect frame count — by far the most common cause. If your frame count doesn't match your audio length or isn't a valid 8n+1 value, sync will drift. Verify using the formula:
total_frames = audio_duration_seconds × 25, rounded to the nearest 8n+1 value. mask_audiois not set toFalsein the LTXV Set Audio Video Mask By Time node — the audio latent won't be frozen and sync won't be enforced- Audio contains significant background noise or music that competes with the speech signal
- Audio sample rate mismatch causing resampling artifacts
Solutions (in order of likelihood):
- Recalculate and correct your frame count first
- Verify
mask_audio=Falseandmask_video=Truein the mask node - Clean your audio — isolate the dialogue track, remove background noise
- Resample audio to 24 kHz before loading into the workflow
- If using I2V, ensure the reference image shows the character with a neutral or slightly open mouth — a closed-mouth reference frame can bias the model toward less mouth movement
Problem: Mouth Artifacts or Frame Jumps
Diagnosis: Mouth position suddenly jumps, frames repeat, or temporal jitter appears.
Causes:
- Norming threshold too low, amplifying small audio variations into large visual changes
- Audio contains sharp transients, plosives, or clipping that create sudden conditioning spikes
- Insufficient denoising steps, causing incomplete refinement of mouth detail
Solutions:
- Increase norming threshold to 0.3–0.4 to dampen audio conditioning spikes
- Apply de-essing and plosive reduction to your audio track before loading
- Increase eta slightly (to 0.25) for more refinement passes
- Increase total sampling steps for finer temporal resolution
Problem: Audio Dropout or Clipping in Output
Diagnosis: Audio is missing, very quiet, or distorted in the generated video file.
Causes:
- Input audio levels too low or clipped
- Incorrect audio format
- Audio file corrupted or truncated
- AI-processed audio used in the final export instead of the original clean file
Solutions:
- Normalize audio to -3 dB to -6 dB peak using Audacity or ffmpeg
- Re-export audio as 16-bit or 24-bit WAV at 24 kHz mono
- Avoid MP3 or other lossy formats — use WAV or FLAC
- Verify you are connecting the original clean audio file directly to VHS Video Combine in the export step
Problem: Overly Stiff or Unnatural Mouth Movement
Diagnosis: Mouth shapes look mechanical, articulation is over-pronounced, or movements lack fluidity.
Causes:
- Eta too low (below 0.1), insufficient refinement to add natural detail to mouth movement
- Text prompt describes specific mouth positions or speech patterns, conflicting with audio conditioning
- CFG too high (above 5.0), over-constraining visual generation
Solutions:
- Increase eta to 0.2–0.25 for additional refinement detail
- Remove any motion or speech descriptions from your text prompt — let audio define all temporal behavior
- Keep CFG at 4 for audio-to-video workflows
Comparison: Image-to-Video vs Text-to-Video for Lip Sync
LTX-2.3 supports both image-to-video (I2V) and text-to-video (T2V) audio conditioning. Each has trade-offs for lip sync quality.
Practical recommendation: For critical dialogue where lip sync must be perfect, use I2V with a carefully generated first frame. The visual anchor reduces variation and improves sync consistency. For iteration and exploration, use T2V with clear prompts.
Optimization: Distilled vs Full Model for Mouth Sync
LTX-2.3 offers two model variants optimized for different stages of production:
Distilled Model: Smaller, faster, lower VRAM requirements. For audio-to-video lip sync specifically, the distilled model at CFG 4 and 40 steps is the recommended configuration — not just for iteration, but for final renders. Community testing specific to this workflow finds it outperforms the full model for prompt adherence and sync tightness.
Full Model: Larger, slower, higher VRAM. Superior detail in many workflows, but for audio-to-video the distilled model is preferred.
When to Use Distilled
For audio-to-video lip sync, use the distilled model as your primary production model. Generate low-resolution previews (480p, short duration) to test parameter configurations quickly, then scale to final resolution with the same model and parameters.
Typical iteration cycle:
- Set parameters: CFG 4, steps 40, eta 0.2
- Calculate frame count before rendering
- Render at 480p for a short clip duration
- Review sync quality — watch with audio, check that mouth shapes match phonemes
- Adjust eta or norming threshold if needed
- Re-render and compare. Repeat until sync is tight and natural
- Scale to final resolution keeping all parameters identical
Advanced: Fine-Tuning Audio Preprocessing
For production-grade lip sync, audio quality matters as much as model configuration. Professional audio preparation increases sync consistency and reduces artifacts.
Preparing Clean Audio Tracks
Noise reduction: Use Audacity or similar tools to remove background hum, room noise, and wind. The model learns to sync with clean vocal/dialogue content; background noise confuses the audio conditioning.
Normalization: Set peak audio levels to -3 dB to -6 dB. Avoid clipping (levels above 0 dB). Avoid extremely quiet passages (below -30 dB).
Silence trimming: Remove or shorten long silences between speech. The model can handle brief pauses (0.5–1 second), but gaps longer than 2 seconds can cause the model to lose sync coherence.
Dialogue vs Music vs Ambient Sound: Mixing Strategy for Sync
If your audio includes dialogue + background music, mix them intentionally.
Conclusion and Next Steps
Achieving tight lip synchronization in LTX-2.3 requires understanding three interconnected pieces: the temporal masking mechanism that drives audio-video alignment, the frame count math that prevents sync drift, and a prompt strategy that prioritizes visual appearance over motion description.
The workflow is straightforward once you internalize the core mechanic: encode image and audio separately, concatenate latents, freeze the audio with mask_audio=False — this is what forces the video to follow the audio. Get the frame count right first, then tune from there.
Start here:
- Set up the ComfyUI audio-to-video workflow (Purzbeats or Benji's AI Playground) with clean, 24 kHz mono audio
- Calculate your frame count:
audio_seconds × 25, rounded to nearest 8n+1 value - Configure the mask node:
mask_audio=False,mask_video=True - Use the distilled model: CFG 4, 40 steps, eta 0.2
- Write prompts that describe visual appearance only — no motion or speech pacing
- Connect your original clean audio directly to VHS Video Combine for the final export
Lip synchronization in AI video is no longer an unsolved problem. With LTX-2.3's joint audio-video architecture, temporal masking, and systematic parameter tuning, you can generate dialogue-heavy content with mouth-audio alignment that rivals traditional video production methods—and do it in minutes on local hardware.
