ComfyUI Video Generation Model Workflow Guide (LTX-2.3)

How to run LTX-2.3 in ComfyUI for better video quality — covering optimal settings, key nodes, text and image-to-video workflows.

LTX Team

March 26, 2026

Start Now

Table of Contents:

Key Takeaways:

ComfyUI works because it exposes the full inference pipeline — giving you granular control over guidance, resolution up to 2560x1440, 50fps output, and faster inference.
Key workflow upgrades: use the Multimodal Guider to tune prompt adherence and cross-modal sync independently, and save/load conditioning to reuse prompt encodings across batches without re-encoding.
The most common quality killers are wrong aspect ratio, CFG scale too high, and overstuffed prompts — fix those first before tuning anything else.

5 million downloads and counting. That's how many developers have pulled LTX-2 & LTX-2.3 from HuggingFace since we released the open weights.

But here's what surprised us: most of them are still using the desktop app for serious work, when the ComfyUI nodes—the official ones we built—actually produce better results in most cases. The difference isn't small. We're talking sharper details, more consistent motion, faster inference, and way more control over every parameter that matters.

The reason is simple: ComfyUI gives you granular access to the inference pipeline. You can tune text guidance separately from cross-modal alignment. You can reuse prompt encodings across batches. You can push resolution to 2560x1440 and framerate to 50fps without fighting a GUI. Desktop simplifies things, but it hides the levers that make the difference.

This guide walks you through every step: installation, optimal settings for text-to-video and image-to-video, the nodes that save you VRAM and time, and the exact mistakes that tank quality.

By the end, you'll have a reproducible workflow that outperforms the desktop app.

Why ComfyUI Outperforms Desktop for LTX-2.3 Video Generation

The question keeps coming up in the community: why does the same model produce noticeably different results in ComfyUI versus Desktop?

The answer is pipeline architecture. Desktop uses a Python pipeline that bundles decisions—it encodes text, computes guidance, runs inference, and decodes the video in a fixed sequence. Solid. Fast. But fixed. You get good defaults, not optimal settings for your specific use case.

ComfyUI exposes the actual pipeline as discrete nodes. That means you control encoding, guidance, and inference separately. You can experiment. You can see what each parameter does. You can iterate without re-encoding text every time.

Here's what this unlocks:

Better quality at higher resolution. The community has confirmed that pushing 2560x1440 and 50fps dramatically improves detail and motion smoothness. Desktop's GUI isn't optimized for these settings. ComfyUI lets you dial in exactly what your hardware can handle.

Reusable conditioning. The LTXVSaveConditioning and LTXVLoadConditioning nodes let you encode a prompt once, then reuse that encoding across multiple inference runs. You save compute, reduce latency, and keep results consistent.

Granular guidance control. The Multimodal Guider node lets you tune prompt adherence and cross-modal consistency independently.

You can dial up motion fluidity without overfitting to the prompt — that's the difference between creepy, over-constrained motion and natural, believable movement.

Faster inference. The January 2026 update brought significant speed improvements to the ComfyUI nodes, primarily because ComfyUI skips GUI overhead and lets you reuse conditioning across runs. Desktop hasn't caught up.

The official ComfyUI-LTXVideo nodes use the latest VAE and the correct inference pipeline. That's why they beat Desktop. It's not magic. It's access.

Setting Up LTX-2.3 in ComfyUI

Prerequisites and Installation

You need three things running:

ComfyUI (latest version). If you don't have it, clone from the official repo and install dependencies.
ComfyUI-LTXVideo nodes (the official Lightricks package). These don't come with ComfyUI by default, so you'll install them separately.
The LTX-2.3 model weights from HuggingFace. You'll specify the model ID when you set up your nodes.

Technically you can run this on CPU, but you'll wait 10 minutes per frame. Seriously. The full bf16 model requires 32GB+ VRAM — an A100 or RTX 6000 Ada gives you comfortable headroom. If you're on a 16GB or 24GB card, you'll need a quantized variant (GGUF or fp8) to fit the model in memory.

Installing ComfyUI-LTXVideo Nodes

The easiest route: use the ComfyUI Manager. Search for "LTXVideo," click Install, restart ComfyUI.

If that doesn't work, clone the repository directly into your ComfyUI custom_nodes folder:

git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git

Restart ComfyUI. You'll see the LTX nodes under a new "Video" category in your node browser.

Load the model the first time you use an LTX node. It downloads from HuggingFace and caches locally. First load takes a few minutes depending on your internet. Subsequent loads are instant.

That's it. You're ready.

The Best ComfyUI Workflows for Text-to-Video

A text-to-video workflow is simple in structure but powerful in what it outputs: video from a prompt.

Here's the minimal setup:

Gemma API Text Encoding node — Encodes your prompt into embeddings the model understands.
LTXVTextToVideoSampler node — Runs the actual inference. Takes the embeddings, noise, and settings, outputs video frames.
VAE Decode node — Converts latent space back to pixel space (the video you actually see).
Video Combine node — Stitches frames into a video file.

That's the skeleton. Quality comes from the settings you feed into the sampler.

Optimal Settings for Quality

Resolution: Start at 1280x720. You get good results in 5-10 minutes on solid hardware. Once you're comfortable, push to 1920x1080 or 2560x1440. The detail boost is real. The time cost is linear.

Framerate: 24fps is broadcast standard. 30fps feels more fluid. 50fps looks cinematic and catches subtle motion detail. The longer your video, the more 48fps matters. For 8-second clips, it's worth it.

Steps: Steps depend on which model you're running. The distilled model — recommended for most users — runs in 4–8 steps using a manual sigma schedule.

The full model uses 20–50 steps, with 50 as a sensible default that balances quality and speed.

At 40 steps it's noticeably faster with minimal quality loss; at 80 it pushes quality but roughly doubles render time. This guide is built around the full model — if you're on the distilled model, ignore the step counts above and stay in the 4–8 range.

CFG Scale (Guidance): This controls how strictly the model follows your prompt. 3.0-3.5 is the default. Stick to 2.0-5.0 for video.

Seed: Set it to a number (any number) if you want reproducibility. Set it to -1 for randomness. Video generation benefits from slightly different seeds within a series. If one frame looks off, reseed and rerun.

Using the Multimodal Guider

This is the secret weapon.

Under the hood it exposes separate controls for CFG guidance, Spatio-Temporal Guidance (STG), and modality scale — with independent settings for both video and audio streams (this guide focuses on video; the audio controls follow the same logic if you're working with sound).

Here's why the separation matters: sometimes your prompt pulls the video toward strict word fidelity while cross-modal sync pulls toward frame-to-frame semantic consistency.

If you push prompt adherence too hard without tuning sync, you get jittery, overfitted motion. If you lean too far into cross-modal, the video drifts from your prompt.

The Multimodal Guider lets you balance these independently rather than fighting a single CFG slider.

Prompt adherence controls how strictly the model follows your exact wording. Lower values give more creative interpretation; higher values lock closer to the prompt.

Cross-modal sync controls internal consistency and semantic coherence across frames. A modest increase smooths motion and reduces flicker. Push it too far and it tends to backfire.

If your video looks jittery or has temporal artifacts, try increasing cross-modal sync. If it isn't following your prompt, increase prompt adherence. Avoid maxing both at once.

ComfyUI Image-to-Video Workflow with LTX-2.3

Image-to-video is text-to-video's practical cousin. You start with a still image, add a motion prompt, and the model generates video that extends from that starting frame.

The workflow is nearly identical to text-to-video, except you feed the image into an image loader and the LTXVImageToVideoSampler instead of the text sampler.

Image Loader node — Points to your starting image (PNG, JPG, 16:9 ratio works best).
Resize Image node — Scales to your target resolution (1920x1080, 2560x1440, etc.).
Gemma API Text Encoding node — Encodes your motion prompt. Keep it short: "camera pans left, water ripples, gentle motion."
LTXVImageToVideoSampler node — Generates video from the image and motion prompt.
VAE Decode and Video Combine — Same as text-to-video.

4 Sampling Steps for Better Motion

Step 1: Motion Prompts, Not Scene Prompts

Your prompt should describe motion, not the scene. Bad: "a beautiful mountain landscape." Good: "camera slowly zooms in, clouds drift across the sky, light rays through trees."

Image-to-video assumes you already have the scene. You're adding movement. Write accordingly.

Step 2: Dial Down CFG Scale

Start at 3.0, not 7.0. The model already has the image as ground truth. It doesn't need as much guidance. High CFG can distort the image into something unrecognizable.

Step 3: Increase Steps for Smooth Motion

Image-to-video tends to benefit from a higher step count than text-to-video. If you're running the full model, try 60–80 instead of the standard 50.

The extra passes help smooth motion and reduce jitter at the image-to-video boundary. On the distilled model, stick with 8 steps — pushing beyond that won't improve motion and will just slow you down.

Step 4: Higher Frame Rate Helps

50fps is worth it for image-to-video. It makes the transition from the static image to video motion feel less jarring.

Resolution and Frame Rate Settings

Your image resolution should match your target video resolution. If your image is 1280x720 and you target 2560x1440, you'll get upsampling artifacts. Prep your image first.

The model works best at widescreen aspect ratios (16:9, 21:9). Portrait and square often produce distorted results.

Framerate: match whatever downstream pipeline expects. YouTube likes 24fps. Social media and streaming apps prefer 30fps or 60fps. 48fps gives you the most flexibility in post. Encode once at 48fps, downconvert later if needed.

Advanced Nodes That Save Time and VRAM

Once you understand the basics, these nodes unlock serious efficiency.

Gemma API Text Encoding

By default, ComfyUI-LTXVideo uses a local Gemma encoder. It works but ties up your GPU. The Gemma API Text Encoding node offloads encoding to Lightricks' free API instead.

Why use it:

Encodes in under 1 second, regardless of GPU.
Frees up your GPU for the expensive sampling step.
Stays free indefinitely (no token limits, no soft limits).
Perfect if you're iterating on settings and rerunning inference multiple times.

Drop the Gemma API Text Encoding node in place of the local encoder, pass your prompt, and it returns embeddings ready for the sampler. That's it.

Save and Load Conditioning

Here's a workflow superpower: encode your prompt once, save it, then reuse it across 50 different inference runs without re-encoding.

LTXVSaveConditioning node — Takes the encoded embeddings from your text encoder and saves them to disk as a .safetensors file.

LTXVLoadConditioning node — Loads that .npy file in future runs.

Why do this? Because encoding is expensive (few seconds even on GPU). If you're testing 10 different sampler settings with the same prompt, you encode once, then load the conditioning 10 times. You save seconds per run. Across a day of iteration, you save hours.

It also makes your workflow reproducible. Same encoding, same random seed, same settings = identical output every time.

Common Mistakes and How to Fix Them

Mistake 1: Ignoring Aspect Ratio

You feed a 1:1 square image into LTX and expect cinematic output. You get distorted, weird motion. LTX trains on 16:9 footage. Stick to that ratio. Prep your images accordingly.

Fix: Crop or pad images to 16:9 before inference.

Mistake 2: Text Guidance Too High

You set CFG scale to 15 because you want the model to "really follow your prompt." The video becomes stilted, over-constrained, and motion looks robotic.

Fix: Start at 3.0. Go lower (2.0) for creative freedom. Go higher (5.0) only if the model is completely ignoring your prompt.

Mistake 3: Cramming Too Much into Your Prompt

"A woman walks through a misty forest at sunset, camera zooms in and out, birds fly overhead, light rays pierce the trees, she stops and looks at the camera while smiling and raising her hand." That's 5 different things happening. LTX can't choreograph all of that.

Fix: Keep prompts to 1-2 actions max. Let the model handle the details. "Woman walks through misty forest, camera slowly zooms" is enough.

Mistake 4: Low Resolution, Then Upscaling

You generate at 1280x720, then run it through an upscaler, expecting 4K quality. You get artifacting and blur.

Fix: Generate at your target resolution from the start. 2560x1440 takes maybe 30% longer than 1080p. Worth it.

Mistake 5: Not Setting a Seed

You generate a great video, then regenerate with the same settings and get something completely different because the seed randomized. Now you can't reproduce it.

Fix: Always set a seed to a specific number if reproducibility matters. Set to -1 only when you want variation.

Mistake 6: Using Desktop When You Need Precision

You need repeatable, tweakable results. Desktop's GUI isn't designed for that. ComfyUI is.

Fix: Commit to the node workflow. Yes, it's a bit steeper to learn. The payoff in control and quality is enormous.

Conclusion

LTX-2.3 is one of the best open-weight video models available. But its quality ceiling is determined by the pipeline you run it through. ComfyUI-LTXVideo gives you that pipeline—the same one we use internally, with the same VAE and inference logic that powers our products.

Start with the text-to-video workflow. Get comfortable with resolution, framerate, and CFG scale. Then move to image-to-video and the advanced nodes. By your tenth workflow, you'll have a repeatable process that beats what most people get from the desktop app.

Your hardware will thank you. Your clients will notice. And the community will benefit when you share what you learned.

Ready to start? Install the ComfyUI-LTXVideo nodes today. Your first workflow is just a few nodes away.

ComfyUI Video Generation Model Workflow Guide (LTX-2.3)

Why ComfyUI Outperforms Desktop for LTX-2.3 Video Generation

Setting Up LTX-2.3 in ComfyUI

Prerequisites and Installation

Installing ComfyUI-LTXVideo Nodes

The Best ComfyUI Workflows for Text-to-Video

Optimal Settings for Quality

Using the Multimodal Guider

ComfyUI Image-to-Video Workflow with LTX-2.3

4 Sampling Steps for Better Motion

Resolution and Frame Rate Settings

Advanced Nodes That Save Time and VRAM

Gemma API Text Encoding

Save and Load Conditioning

Common Mistakes and How to Fix Them

Conclusion

Products

Company

Resources

Social

Legal

Legal

Related posts

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Products

Company

Resources

Social

Legal

Legal