- ComfyUI works because it exposes the full inference pipeline — giving you granular control over guidance, resolution up to 2560x1440, 50fps output, and faster inference.
- Key workflow upgrades: use the Multimodal Guider to separate text guidance from cross-modal alignment, and save/load conditioning to reuse prompt encodings across batches without re-encoding.
- The most common quality killers are wrong aspect ratio, CFG scale too high, and overstuffed prompts — fix those first before tuning anything else.
5 million downloads and counting. That's how many developers have pulled LTX-2 & LTX-2.3 from HuggingFace since we released the open weights.
But here's what surprised us: most of them are still using the desktop app for serious work, when the ComfyUI nodes—the official ones we built—actually produce better results in most cases. The difference isn't small. We're talking sharper details, more consistent motion, faster inference, and way more control over every parameter that matters.
The reason is simple: ComfyUI gives you granular access to the inference pipeline. You can tune text guidance separately from cross-modal alignment. You can reuse prompt encodings across batches. You can push resolution to 2560x1440 and framerate to 50fps without fighting a GUI. Desktop simplifies things, but it hides the levers that make the difference.
This guide walks you through every step: installation, optimal settings for text-to-video and image-to-video, the nodes that save you VRAM and time, and the exact mistakes that tank quality.
By the end, you'll have a reproducible workflow that outperforms the desktop app.
Why ComfyUI Outperforms Desktop for LTX-2.3 Video Generation
The question keeps coming up in the community: why does the same model produce noticeably different results in ComfyUI versus Desktop?
The answer is pipeline architecture. Desktop uses a Python pipeline that bundles decisions—it encodes text, computes guidance, runs inference, and decodes the video in a fixed sequence. Solid. Fast. But fixed. You get good defaults, not optimal settings for your specific use case.
ComfyUI exposes the actual pipeline as discrete nodes. That means you control encoding, guidance, and inference separately. You can experiment. You can see what each parameter does. You can iterate without re-encoding text every time.
Here's what this unlocks:
Better quality at higher resolution. The community has confirmed that pushing 2560x1440 and 50fps dramatically improves detail and motion smoothness. Desktop's GUI isn't optimized for these settings. ComfyUI lets you dial in exactly what your hardware can handle.
Reusable conditioning. The LTXVSaveConditioning and LTXVLoadConditioning nodes let you encode a prompt once, then reuse that encoding across multiple inference runs. You save compute, reduce latency, and keep results consistent.
Granular guidance control. The Multimodal Guider node separates text guidance from cross-modal alignment. You can dial up motion fluidity without overfitting to the prompt. That's the difference between creepy over-constrained motion and natural, believable movement.
Faster inference. The January 2026 update brought significant speed improvements to the ComfyUI nodes, primarily because ComfyUI skips GUI overhead and lets you reuse conditioning across runs. Desktop hasn't caught up.
The official ComfyUI-LTXVideo nodes use the latest VAE and the correct inference pipeline. That's why they beat Desktop. It's not magic. It's access.
Setting Up LTX-2.3 in ComfyUI
Prerequisites and Installation
You need three things running:
- ComfyUI (latest version). If you don't have it, clone from the official repo and install dependencies.
- ComfyUI-LTXVideo nodes (the official Lightricks package). These don't come with ComfyUI by default, so you'll install them separately.
- The LTX-2.3 model weights from HuggingFace. You'll specify the model ID when you set up your nodes.
Technically you can run this on CPU, but you'll wait 10 minutes per frame. Seriously. The full bf16 model requires 32GB+ VRAM — an A100 or RTX 6000 Ada gives you comfortable headroom. If you're on a 16GB or 24GB card, you'll need a quantized variant (GGUF or fp8) to fit the model in memory.
Installing ComfyUI-LTXVideo Nodes
The easiest route: use the ComfyUI Manager. Search for "LTXVideo," click Install, restart ComfyUI.
If that doesn't work, clone the repository directly into your ComfyUI custom_nodes folder:
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
Restart ComfyUI. You'll see the LTX nodes under a new "Video" category in your node browser.
Load the model the first time you use an LTX node. It downloads from HuggingFace and caches locally. First load takes a few minutes depending on your internet. Subsequent loads are instant.
That's it. You're ready.
The Best ComfyUI Workflows for Text-to-Video
A text-to-video workflow is simple in structure but powerful in what it outputs: video from a prompt.
Here's the minimal setup:
- Gemma API Text Encoding node — Encodes your prompt into embeddings the model understands.
- LTXVTextToVideoSampler node — Runs the actual inference. Takes the embeddings, noise, and settings, outputs video frames.
- VAE Decode node — Converts latent space back to pixel space (the video you actually see).
- Video Combine node — Stitches frames into a video file.
That's the skeleton. Quality comes from the settings you feed into the sampler.
Optimal Settings for Quality
Resolution: Start at 1280x720. You get good results in 5-10 minutes on solid hardware. Once you're comfortable, push to 1920x1080 or 2560x1440. The detail boost is real. The time cost is linear.
Framerate: 24fps is broadcast standard. 30fps feels more fluid. 50fps looks cinematic and catches subtle motion detail. The longer your video, the more 48fps matters. For 8-second clips, it's worth it.
Steps: 50 is the default. It balances quality and speed. 40 runs noticeably faster with minimal quality loss. 80 pushes quality but takes double the time. Start at 50, then adjust based on what you see.
CFG Scale (Guidance): This controls how strictly the model follows your prompt. 7.0 is the default. 5.0 feels more creative and sometimes more natural. 9.0 makes motion feel constrained. Anything above 12 usually looks weird. Stick to 5-8 for video.
Seed: Set it to a number (any number) if you want reproducibility. Set it to -1 for randomness. Video generation benefits from slightly different seeds within a series. If one frame looks off, reseed and rerun.
Using the Multimodal Guider
This is the secret weapon. The Multimodal Guider node lets you separate text guidance from cross-modal alignment.
Here's why this matters: sometimes your prompt pulls the video in one direction (fidelity to the words) and cross-modal alignment pulls it in another (consistency across frames and with the prompt's semantic meaning). If you crank text guidance without tuning cross-modal, you get jittery, overfitted motion. If you lean too hard on cross-modal, you lose prompt adherence.
The Multimodal Guider lets you dial them separately.
Text Guidance: How much the model should prioritize your exact words. Default is 7.0. Go lower (4-6) for more creative interpretation. Go higher (8-10) for stricter adherence.
Cross-Modal Alignment: How well the video should be internally consistent and semantically coherent with the prompt. Default is 1.0. Increasing this (1.5-2.0) smooths motion and reduces flicker. Going above 2.0 usually backfires.
If your video looks jittery or has temporal artifacts, increase cross-modal alignment. If it doesn't follow your prompt, increase text guidance. Don't max both.
ComfyUI Image-to-Video Workflow with LTX-2.3
Image-to-video is text-to-video's practical cousin. You start with a still image, add a motion prompt, and the model generates video that extends from that starting frame.
The workflow is nearly identical to text-to-video, except you feed the image into an image loader and the LTXVImageToVideoSampler instead of the text sampler.
- Image Loader node — Points to your starting image (PNG, JPG, 16:9 ratio works best).
- Resize Image node — Scales to your target resolution (1920x1080, 2560x1440, etc.).
- Gemma API Text Encoding node — Encodes your motion prompt. Keep it short: "camera pans left, water ripples, gentle motion."
- LTXVImageToVideoSampler node — Generates video from the image and motion prompt.
- VAE Decode and Video Combine — Same as text-to-video.
4 Sampling Steps for Better Motion
Step 1: Motion Prompts, Not Scene Prompts
Your prompt should describe motion, not the scene. Bad: "a beautiful mountain landscape." Good: "camera slowly zooms in, clouds drift across the sky, light rays through trees."
Image-to-video assumes you already have the scene. You're adding movement. Write accordingly.
Step 2: Dial Down CFG Scale
Start at 5.0, not 7.0. The model already has the image as ground truth. It doesn't need as much guidance. High CFG can distort the image into something unrecognizable.
Step 3: Increase Steps for Smooth Motion
Image-to-video benefits from more steps than text-to-video. Use 60-80 instead of 50. The extra cycles smooth motion and reduce jitter at the image-video boundary.
Step 4: Higher Frame Rate Helps
50fps is worth it for image-to-video. It makes the transition from the static image to video motion feel less jarring.
Resolution and Frame Rate Settings
Your image resolution should match your target video resolution. If your image is 1280x720 and you target 2560x1440, you'll get upsampling artifacts. Prep your image first.
The model works best at widescreen aspect ratios (16:9, 21:9). Portrait and square often produce distorted results.
Framerate: match whatever downstream pipeline expects. YouTube likes 24fps. Social media and streaming apps prefer 30fps or 60fps. 48fps gives you the most flexibility in post. Encode once at 48fps, downconvert later if needed.
Advanced Nodes That Save Time and VRAM
Once you understand the basics, these nodes unlock serious efficiency.
Gemma API Text Encoding
By default, ComfyUI-LTXVideo uses a local Gemma encoder. It works but ties up your GPU. The Gemma API Text Encoding node offloads encoding to Lightricks' free API instead.
Why use it:
- Encodes in under 1 second, regardless of GPU.
- Frees up your GPU for the expensive sampling step.
- Stays free indefinitely (no token limits, no soft limits).
- Perfect if you're iterating on settings and rerunning inference multiple times.
Drop the Gemma API Text Encoding node in place of the local encoder, pass your prompt, and it returns embeddings ready for the sampler. That's it.
Save and Load Conditioning
Here's a workflow superpower: encode your prompt once, save it, then reuse it across 50 different inference runs without re-encoding.
LTXVSaveConditioning node — Takes the encoded embeddings from your text encoder and saves them to disk as a .npy file.
LTXVLoadConditioning node — Loads that .npy file in future runs.
Why do this? Because encoding is expensive (few seconds even on GPU). If you're testing 10 different sampler settings with the same prompt, you encode once, then load the conditioning 10 times. You save seconds per run. Across a day of iteration, you save hours.
It also makes your workflow reproducible. Same encoding, same random seed, same settings = identical output every time.
Common Mistakes and How to Fix Them
Mistake 1: Ignoring Aspect Ratio
You feed a 1:1 square image into LTX and expect cinematic output. You get distorted, weird motion. LTX trains on 16:9 footage. Stick to that ratio. Prep your images accordingly.
Fix: Crop or pad images to 16:9 before inference.
Mistake 2: Text Guidance Too High
You set CFG scale to 15 because you want the model to "really follow your prompt." The video becomes stilted, over-constrained, and motion looks robotic.
Fix: Start at 7.0. Go lower (5-6) for creative freedom. Go higher (8-10) only if the model is completely ignoring your prompt.
Mistake 3: Cramming Too Much into Your Prompt
"A woman walks through a misty forest at sunset, camera zooms in and out, birds fly overhead, light rays pierce the trees, she stops and looks at the camera while smiling and raising her hand." That's 5 different things happening. LTX can't choreograph all of that.
Fix: Keep prompts to 1-2 actions max. Let the model handle the details. "Woman walks through misty forest, camera slowly zooms" is enough.
Mistake 4: Low Resolution, Then Upscaling
You generate at 1280x720, then run it through an upscaler, expecting 4K quality. You get artifacting and blur.
Fix: Generate at your target resolution from the start. 2560x1440 takes maybe 30% longer than 1080p. Worth it.
Mistake 5: Not Setting a Seed
You generate a great video, then regenerate with the same settings and get something completely different because the seed randomized. Now you can't reproduce it.
Fix: Always set a seed to a specific number if reproducibility matters. Set to -1 only when you want variation.
Mistake 6: Using Desktop When You Need Precision
You need repeatable, tweakable results. Desktop's GUI isn't designed for that. ComfyUI is.
Fix: Commit to the node workflow. Yes, it's a bit steeper to learn. The payoff in control and quality is enormous.
Conclusion
LTX-2.3 is one of the best open-weight video models available. But its quality ceiling is determined by the pipeline you run it through. ComfyUI-LTXVideo gives you that pipeline—the same one we use internally, with the same VAE and inference logic that powers our products.
Start with the text-to-video workflow. Get comfortable with resolution, framerate, and CFG scale. Then move to image-to-video and the advanced nodes. By your tenth workflow, you'll have a repeatable process that beats what most people get from the desktop app.
Your hardware will thank you. Your clients will notice. And the community will benefit when you share what you learned.
Ready to start? Install the ComfyUI-LTXVideo nodes today. Your first workflow is just a few nodes away.
.jpeg)