How To Get Good Lip Syncing On LTX-2.3 (Complete Guide)

Master audio-video synchronization in LTX-2.3. Learn Multimodal Guider configuration, parameter tuning, and workflows for flawless mouth-dialogue sync in AI video.

LTX Team

March 31, 2026

Get API key

Table of Contents:

Key Takeaways:

LTX-2.3 uses temporal masking via the LTXV Set Audio Video Mask By Time node to enforce lip sync — locking the audio latent forces generated video frames to physically align with the audio, driving mouth movement to match phonemes.
Frame count must be calculated precisely before tuning any other parameter: use audio_duration_seconds × 25, then round to the nearest valid 8n+1 value. An incorrect frame count is the most common cause of sync drift.
Keep text prompts focused on visual appearance, lighting, and character description — never describe motion or speech pacing, as these conflict with audio conditioning and degrade sync quality.

Understanding Lip Sync in AI Video Generation

Lip synchronization isn't just cosmetic. Dialogue-heavy content—voiceovers, interviews, character dialogue in narrative films—requires mouth movements that align with audio timing to within a frame or two. When audio and video drift, viewers immediately perceive the mismatch. The content feels artificial, and trust erodes.

Why is this hard? Video generation models traditionally operated in pixel space or visual latent space. Audio lived in a completely different domain—mel-spectrograms, waveforms, or audio embeddings. Conditioning one modality on the other required architectural bridges that earlier models either lacked or implemented weakly.

LTX-2.3 changed this. The model was trained on large-scale audio-video pairs with explicit conditioning on audio structure. When you feed audio into the generation pipeline, the model doesn't guess how mouths should move—it directly samples video that respects the acoustic timing of the input audio.

LTX-2.3's Lip Sync Mechanism: Temporal Masking

The core lip sync mechanism in LTX-2.3 is temporal masking via the LTXV Set Audio Video Mask By Time node. Understanding how this works is essential to achieving consistent, high-quality sync.

How Temporal Masking Works

The workflow encodes the reference image and audio into latent space separately — the image via LTXVImgToVideoInplace (or VAE Encode for image-to-video), and the audio via LTXV Audio VAE Encode. These latents are then merged using LTXVConcatAVLatent, and the combined latents are passed into the mask node with mask_audio=False and mask_video=True.

Setting mask_audio=False is the critical step: it freezes the audio latent, turning it into a direct physical constraint on the diffusion process. Because LTX-2 is a joint audio-video model, locking the audio forces the generated video frames to align with the audio — driving mouth movement to match phonemes frame by frame. This is the core mechanic that makes lip sync work.

The first sampling pass uses LTXVNormalizingSampler. For the final export step, decode the video latents and connect the original clean audio file directly to VHS Video Combine — bypassing the AI-processed audio preserves quality in the final MP4.

The MultimodalGuider's Role

The MultimodalGuider is present in the workflow and exposes per-modality control via cfg, stg, and modality_scale parameters. It is not, however, the primary lip sync control — that is temporal masking. Think of the MultimodalGuider as controlling the overall guidance balance, while LTXV Set Audio Video Mask By Time is what actually enforces audio-video alignment at the frame level.

Step-by-Step: Setting Up Audio-to-Video for Lip Sync in ComfyUI

LTX-2.3's audio-to-video workflow is built for ComfyUI. This section walks you through each step with specific node names and configurations.

Preparing Your Audio Input

Audio quality directly impacts sync quality. Before feeding audio into the model, prepare it correctly:

Format: WAV or FLAC, 16-bit or 24-bit, mono channel. The model expects single-channel audio — stereo tracks should be mixed down to mono before input.
Sample rate: 24 kHz is the native conditioning rate. Audio at other sample rates (44.1 kHz, 48 kHz) will be resampled automatically, but starting at 24 kHz avoids interpolation artifacts.
Duration: Match your target video duration exactly. Mismatched duration forces the pipeline to stretch or truncate, degrading sync.
Content: Clean dialogue or voiceover with minimal background noise. Remove room tone, breath artifacts, and long silences (>2 seconds) between phrases.
Levels: Normalize to -3 dB to -6 dB peak. Avoid clipping (above 0 dB) or excessively quiet passages (below -30 dB).

For music or ambient audio, less pre-processing is needed. The model is more tolerant of variation in non-speech content.

Calculating Frame Count (Do This Before Anything Else)

This is the most important step in the entire workflow. An incorrect frame count is the most common cause of sync drift — more common than any parameter misconfiguration. Do not skip this.

LTX-2.3 processes audio at exactly 25 latents per second. Your frame count must be calculated to match your audio duration precisely, and it must satisfy the 8n+1 rule: valid frame counts are 1, 9, 17, 25, 33... 97, 121, 161, 257, etc. (values of the form 8n+1).

Formula:total_frames = audio_duration_seconds × 25Then round to the nearest valid 8n+1 value.

Example: 5 seconds of audio → 125 frames → round to 121 (the nearest valid 8n+1 value below 125).

Why this matters: If your frame count doesn't match the audio length, sync will drift progressively across the video — it starts aligned and slowly falls out of time. If your frame count is not a valid 8n+1 value, you'll encounter pipeline errors or unexpected output. Set this correctly before touching any other parameter.

Loading the Workflow in ComfyUI

There is no official LTX audio-to-video workflow JSON in the LTX-2 repository or the ComfyUI-LTXVideo repository. The community has filled this gap — the widely recommended starting points are Purzbeats' Audio-to-Video Extension workflow (ltx2-audio_to_video_extension_5x.json) and Benji's AI Playground tutorial, both of which are built around the correct temporal masking mechanism.

Download one of these community workflows and open it in ComfyUI.

The workflow includes pre-configured nodes for:

Audio loading and preprocessing (resampling, mono conversion)
Image and audio encoding into separate latent spaces
Latent concatenation and temporal masking
Normalized sampling for the first pass
VAE decoding for final video output
Clean audio connection to video combine for export

Critical nodes to identify:

LTXVCheckpointLoader — Loads the LTX-2.3 checkpoint (Distilled or Full)
LTXV Audio VAE Encode — Encodes raw audio into the latent space
LTXVConcatAVLatent — Merges image and audio latents
LTXV Set Audio Video Mask By Time — The core lip sync control (set mask_audio=False, mask_video=True)
LTXVNormalizingSampler — The sampling node for the first pass
VHS Video Combine — Final export node; connect your original clean audio here directly

Configuring the Audio Conditioning Pipeline

Once the workflow is open, configure these settings:

Audio and Image Encoding:

Load your prepared WAV/FLAC file using the LoadAudio node
Connect it to the LTXV Audio VAE Encode node
Load your reference image and connect it to LTXVImgToVideoInplace (or VAE Encode for I2V)
Merge both encoded latents using LTXVConcatAVLatent

Temporal Masking Configuration:

Connect the combined latents to LTXV Set Audio Video Mask By Time
Set mask_audio=False — this freezes the audio latent as a physical constraint
Set mask_video=True
This configuration is what drives lip sync

Frame Count:

Calculate your frame count using the formula above before proceeding
Set the frame count in the sampler node to your calculated 8n+1 value
Verify this is correct before rendering

Frame Rate Configuration:

Default frame rate is 24 fps, which matches the model's training data distribution
24 fps provides the best lip sync accuracy
Set frame rate in the sampler or video output node — ensure it matches the rate you used to calculate total frames

Final Export:

In the VHS Video Combine node, connect your original clean audio file directly — do not use the AI-processed audio output. This preserves audio quality in the final MP4.

Tuning Parameters for Optimal Mouth Synchronization

Once your workflow is correctly configured with temporal masking and the right frame count, parameter tuning refines output quality.

Model and Sampling Configuration

For audio-to-video lip sync specifically, community testing has validated the following:

Use the distilled model — not the full model. For this workflow, the distilled model at higher CFG and more steps outperforms the general guidance would suggest.
CFG: 4 — this is the recommended value for audio-to-video. It yields the best prompt adherence and tightest lip sync.
Steps: 40 — the validated step count for audio-to-video with the distilled model.

Note: these are workflow-specific findings. The general guidance for LTX-2.3 (full model, lower CFG) does not apply here — the audio-to-video pipeline behaves differently.

Classifier-Free Guidance (CFG)

Recommended for lip sync content: 4

CFG affects how the model balances prompt instructions with audio conditioning. For audio-to-video workflows, CFG 4 with the distilled model produces the best results. Stay within the documented range: the ceiling for the full model is 5.0; anything above that is outside the documented operating range and should be avoided.

Eta (Denoising Strength): Refinement Control

Eta controls how much refinement happens during the sampling process.

Tested range for audio-video sync: 0.1–0.3. Community testing suggests 0.2 is a reliable starting point.

MultimodalGuider Parameters

The MultimodalGuider exposes cfg, stg, and modality_scale per modality. These are not lip sync controls — lip sync is controlled by the temporal masking step. Adjust these only if you have a specific reason to tune the guidance balance; leave them at defaults if sync quality is your primary concern.

Norming Threshold: Audio Influence Calibration

Norming threshold calibrates how strongly audio conditioning influences the generation process.

Typical range: 0.1–0.5

If you notice sudden jumps in mouth position or temporal artifacts, adjusting norming threshold can smooth them out. Start with default (typically 0.2–0.3) and only adjust if you observe issues.

Parameter Tuning Workflow

Start with these defaults (distilled model):

CFG: 4
Steps: 40
Eta: 0.2

Generate a preview: Render at low resolution (480p) with a short clip duration and watch the output carefully.

If lip sync is tight and natural: You're done. Scale to final resolution.

If audio lags behind video (mouth movements come early): Verify your frame count first — this is the most likely cause. Recalculate and re-render before adjusting any other parameter.

If video lags behind audio (mouth movements come late): Again, check frame count first. If frame count is correct, verify mask_audio=False is set correctly in LTXV Set Audio Video Mask By Time.

If mouth movement looks stiff or unnatural: Increase eta slightly (to 0.25). Regenerate.

If temporal artifacts appear (frame jumps, repeated frames): Increase norming threshold to 0.3–0.4. Regenerate.

Practical Workflow: Text Prompt Strategy for Lip Sync Content

Your text prompt shapes everything from character appearance to lighting to background. But for lip sync content, your prompting strategy must prioritize visible mouth movement and facial expression over motion descriptions.

What to Include in Prompts

Character mouth and face: Describe mouth shape, teeth visibility, and lip texture. This helps the model generate detailed mouth regions that sync well with audio.

Example: "Close-up of a woman's face, lips slightly parted, natural skin texture, neutral expression ready to speak."

Dialogue context: Hint at the emotional tone or intensity of speech. Is dialogue urgent, calm, emotional, or informative?

Example: "Actor delivering intense, urgent dialogue with visible emotion in facial expression."

Visual style and lighting: Define how the scene looks—professional studio lighting, natural outdoor light, cinematic color grading.

Example: "Soft studio lighting, warm color grading, shallow depth of field on face."

Background and environment: Keep this minimal for dialogue-focused scenes. A cluttered background distracts from mouth movement.

Example: "Neutral blurred background, focus sharp on face."

Full example prompt for dialogue scene:

Close-up of man's face, mouth naturally positioned, professional studio lighting with soft key light, neutral background, warm color grading, ready to deliver dialogue with clear articulation and visible lip movement. Natural skin texture, minimal makeup, professional appearance.

What to Avoid in Prompts

Do not describe motion, camera movement, or temporal action in dialogue-heavy lip sync content. The temporal masking mechanism already defines temporal behavior. Prompt descriptions of motion conflict with audio guidance and degrade sync quality.

Avoid these:

"A person speaking quickly with animated gestures"
"Camera slowly zooms in as the speaker delivers the line"
"The character nods while talking and then turns away"
"Rapid speech with excited hand movements"
"Slow pan across the speaker's face as they speak softly"

Why? These motion descriptions are text-based. The audio input defines actual timing. If your prompt says "speaks rapidly" but your audio contains slow, measured speech, the conflict degrades output quality.

Instead, let audio define the pacing and motion, and use prompts purely for visual appearance.

Concrete Example Prompts

Scenario: Professional voiceover / narrator

Professional male voice actor, close-up of face from shoulders up, professional broadcast studio setting, crisp studio lighting, neutral background, dark suit and tie, serious professional expression, clear articulation visible in mouth movement, high-definition broadcast quality, warm professional color grading.

Scenario: Emotional dialogue / character acting

Woman's face in close-up, natural theatrical lighting, expressive eyes and mouth ready for emotional performance, neutral background with warm amber glow, beautiful natural skin, soft lighting on face, professional film quality, moment of emotional intensity visible in facial readiness.

Scenario: Educational content / instruction

Instructor at desk or lectern, face clearly visible, friendly and approachable expression, bright natural studio lighting, professional educational setting background, crisp video quality, mouth articulation clear and visible for educational delivery.

Troubleshooting: Common Lip Sync Issues and Solutions

Even with correct configuration, edge cases occur. This section diagnoses common issues and provides targeted fixes.

Problem: Mouth Out of Sync with Dialogue

Diagnosis: Mouth movements visibly lag or lead the audio. Phonemes (mouth shapes) don't match speech sounds.

Causes (check in this order):

Incorrect frame count — by far the most common cause. If your frame count doesn't match your audio length or isn't a valid 8n+1 value, sync will drift. Verify using the formula: total_frames = audio_duration_seconds × 25, rounded to the nearest 8n+1 value.
mask_audio is not set to False in the LTXV Set Audio Video Mask By Time node — the audio latent won't be frozen and sync won't be enforced
Audio contains significant background noise or music that competes with the speech signal
Audio sample rate mismatch causing resampling artifacts

Solutions (in order of likelihood):

Recalculate and correct your frame count first
Verify mask_audio=False and mask_video=True in the mask node
Clean your audio — isolate the dialogue track, remove background noise
Resample audio to 24 kHz before loading into the workflow
If using I2V, ensure the reference image shows the character with a neutral or slightly open mouth — a closed-mouth reference frame can bias the model toward less mouth movement

Problem: Mouth Artifacts or Frame Jumps

Diagnosis: Mouth position suddenly jumps, frames repeat, or temporal jitter appears.

Causes:

Norming threshold too low, amplifying small audio variations into large visual changes
Audio contains sharp transients, plosives, or clipping that create sudden conditioning spikes
Insufficient denoising steps, causing incomplete refinement of mouth detail

Solutions:

Increase norming threshold to 0.3–0.4 to dampen audio conditioning spikes
Apply de-essing and plosive reduction to your audio track before loading
Increase eta slightly (to 0.25) for more refinement passes
Increase total sampling steps for finer temporal resolution

Problem: Audio Dropout or Clipping in Output

Diagnosis: Audio is missing, very quiet, or distorted in the generated video file.

Causes:

Input audio levels too low or clipped
Incorrect audio format
Audio file corrupted or truncated
AI-processed audio used in the final export instead of the original clean file

Solutions:

Normalize audio to -3 dB to -6 dB peak using Audacity or ffmpeg
Re-export audio as 16-bit or 24-bit WAV at 24 kHz mono
Avoid MP3 or other lossy formats — use WAV or FLAC
Verify you are connecting the original clean audio file directly to VHS Video Combine in the export step

Problem: Overly Stiff or Unnatural Mouth Movement

Diagnosis: Mouth shapes look mechanical, articulation is over-pronounced, or movements lack fluidity.

Causes:

Eta too low (below 0.1), insufficient refinement to add natural detail to mouth movement
Text prompt describes specific mouth positions or speech patterns, conflicting with audio conditioning
CFG too high (above 5.0), over-constraining visual generation

Solutions:

Increase eta to 0.2–0.25 for additional refinement detail
Remove any motion or speech descriptions from your text prompt — let audio define all temporal behavior
Keep CFG at 4 for audio-to-video workflows

Comparison: Image-to-Video vs Text-to-Video for Lip Sync

LTX-2.3 supports both image-to-video (I2V) and text-to-video (T2V) audio conditioning. Each has trade-offs for lip sync quality.

Practical recommendation: For critical dialogue where lip sync must be perfect, use I2V with a carefully generated first frame. The visual anchor reduces variation and improves sync consistency. For iteration and exploration, use T2V with clear prompts.

Optimization: Distilled vs Full Model for Mouth Sync

LTX-2.3 offers two model variants optimized for different stages of production:

Distilled Model: Smaller, faster, lower VRAM requirements. For audio-to-video lip sync specifically, the distilled model at CFG 4 and 40 steps is the recommended configuration — not just for iteration, but for final renders. Community testing specific to this workflow finds it outperforms the full model for prompt adherence and sync tightness.

Full Model: Larger, slower, higher VRAM. Superior detail in many workflows, but for audio-to-video the distilled model is preferred.

When to Use Distilled

For audio-to-video lip sync, use the distilled model as your primary production model. Generate low-resolution previews (480p, short duration) to test parameter configurations quickly, then scale to final resolution with the same model and parameters.

Typical iteration cycle:

Set parameters: CFG 4, steps 40, eta 0.2
Calculate frame count before rendering
Render at 480p for a short clip duration
Review sync quality — watch with audio, check that mouth shapes match phonemes
Adjust eta or norming threshold if needed
Re-render and compare. Repeat until sync is tight and natural
Scale to final resolution keeping all parameters identical

Advanced: Fine-Tuning Audio Preprocessing

For production-grade lip sync, audio quality matters as much as model configuration. Professional audio preparation increases sync consistency and reduces artifacts.

Preparing Clean Audio Tracks

Noise reduction: Use Audacity or similar tools to remove background hum, room noise, and wind. The model learns to sync with clean vocal/dialogue content; background noise confuses the audio conditioning.

Normalization: Set peak audio levels to -3 dB to -6 dB. Avoid clipping (levels above 0 dB). Avoid extremely quiet passages (below -30 dB).

Silence trimming: Remove or shorten long silences between speech. The model can handle brief pauses (0.5–1 second), but gaps longer than 2 seconds can cause the model to lose sync coherence.

Dialogue vs Music vs Ambient Sound: Mixing Strategy for Sync

If your audio includes dialogue + background music, mix them intentionally.

Conclusion and Next Steps

Achieving tight lip synchronization in LTX-2.3 requires understanding three interconnected pieces: the temporal masking mechanism that drives audio-video alignment, the frame count math that prevents sync drift, and a prompt strategy that prioritizes visual appearance over motion description.

The workflow is straightforward once you internalize the core mechanic: encode image and audio separately, concatenate latents, freeze the audio with mask_audio=False — this is what forces the video to follow the audio. Get the frame count right first, then tune from there.

Start here:

Set up the ComfyUI audio-to-video workflow (Purzbeats or Benji's AI Playground) with clean, 24 kHz mono audio
Calculate your frame count: audio_seconds × 25, rounded to nearest 8n+1 value
Configure the mask node: mask_audio=False, mask_video=True
Use the distilled model: CFG 4, 40 steps, eta 0.2
Write prompts that describe visual appearance only — no motion or speech pacing
Connect your original clean audio directly to VHS Video Combine for the final export

Lip synchronization in AI video is no longer an unsolved problem. With LTX-2.3's joint audio-video architecture, temporal masking, and systematic parameter tuning, you can generate dialogue-heavy content with mouth-audio alignment that rivals traditional video production methods—and do it in minutes on local hardware.

How To Get Good Lip Syncing On LTX-2.3 (Complete Guide)

Understanding Lip Sync in AI Video Generation

LTX-2.3's Lip Sync Mechanism: Temporal Masking

How Temporal Masking Works

The MultimodalGuider's Role

Step-by-Step: Setting Up Audio-to-Video for Lip Sync in ComfyUI

Preparing Your Audio Input

Calculating Frame Count (Do This Before Anything Else)

Loading the Workflow in ComfyUI

Configuring the Audio Conditioning Pipeline

Tuning Parameters for Optimal Mouth Synchronization

Model and Sampling Configuration

Classifier-Free Guidance (CFG)

Eta (Denoising Strength): Refinement Control

MultimodalGuider Parameters

Norming Threshold: Audio Influence Calibration

Parameter Tuning Workflow

Practical Workflow: Text Prompt Strategy for Lip Sync Content

What to Include in Prompts

What to Avoid in Prompts

Concrete Example Prompts

Troubleshooting: Common Lip Sync Issues and Solutions

Problem: Mouth Out of Sync with Dialogue

Problem: Mouth Artifacts or Frame Jumps

Problem: Audio Dropout or Clipping in Output

Problem: Overly Stiff or Unnatural Mouth Movement

Comparison: Image-to-Video vs Text-to-Video for Lip Sync

Optimization: Distilled vs Full Model for Mouth Sync

When to Use Distilled

Advanced: Fine-Tuning Audio Preprocessing

Preparing Clean Audio Tracks

Dialogue vs Music vs Ambient Sound: Mixing Strategy for Sync

Conclusion and Next Steps

Products

Company

Resources

Social

Legal

Legal

Related posts

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Products

Company

Resources

Social

Legal

Legal