What’s shipping in this release
1. Better fine details: rebuilt latent space and updated VAE
We rebuilt our VAE architecture, trained on higher quality data with an improved recipe. The result is a new latent space allowing sharper output with better preservation of textures and edges.
Previous checkpoints had great motion and structure, but some fine textures, like hair and edge detail, were softer than we would have liked them, especially at lower resolutions. The new architecture generates sharper details across all resolutions. If you’ve been upscaling or sharpening in post to compensate, you should need less of that now.
2. Better prompt understanding: larger and more capable text connector
We increased the capacity of the text connector and improved its architecture that bridges prompt encoding and the generation model. The result is a more accurate interpretation of complex prompts, with less drift from what you actually asked for, especially prompts with multiple subjects, spatial relationships, or specific stylistic instructions.
If you’ve been simplifying prompts to get consistent results, try being more specific. The model handles it now.
3. Improved image-to-video: less freezing, more motion
This was one of the most reported issues. I2V outputs often froze or produced a slow pan instead of real motion. We reworked training to reduce the “Ken Burns” effect, eliminate static videos, reduce unexpected cuts, and improve visual consistency from the input frame.
If you’re using i2v in production pipelines, this should reduce the number of generations you throw away.
4. Cleaner audio
We filtered the training set for silence, noise, and artifacts, and shipped a new vocoder. Audio is more reliable now - fewer random sounds, fewer unexpected drops, tighter alignment. Applies to both text-to-video and audio-to-video.
5. Portrait video support: native vertical up to 1080x1920
Native portrait video, up to 1080×1920. Trained on vertical data, not cropped from landscape. First time in LTX.
Vertical video is the default format for TikTok, Instagram Reels, YouTube Shorts, and most mobile-first content. Portrait mode is native in 2.3: set the resolution and generate.
ComfyUI workflows: ready to go
We’re shipping simple, stable reference workflows as a starting point. These are designed to work out of the box so you can start testing immediately.
API support. The new 2.3 checkpoints will be available through the API and LTX Studio starting
What we built on LTX - and why it matters that we did
The engine is only as credible as what runs on top of it.
So we set out to show what it enables.
LTX Desktop is a professional video editor built entirely on the LTX-2.3 engine - the same weights we’re releasing today. No proprietary layer. No separate stack. We use it internally. Now it’s public.
Run it fully local on your own machine - no internet required after setup, no cost per generation, full access to the model weights. Or use the API as the backend if you prefer not to manage local infrastructure.
For companies over $10M in annual revenue, there's a commercial license.
LTX Desktop is free for everyone else. Open source.
What the community built on LTX-2
Since the January release, nearly five million people downloaded LTX-2. A large part of what happened next wasn't us - it was the community.
EasyCache hit 2.3x inference speedup. Quantizations appeared for hardware we never tested. LoRAs for styles, motions, and use cases we hadn't considered. The community also built and shared a range of ComfyUI nodes that extended what the model could do in ways we hadn't planned for.
For everyone in the community who used it, built on it, or told us what wasn't working — that's what moves this forward. And to the Banodoco community, who keep finding ways to push this further than we expected
Everything that ships today
- LTX-2.3 base checkpoint - live now on Hugging Face
- LTX-2.3 distilled checkpoint and LoRA - live now on Hugging Face
- LTX-2.3 latent upscalers - live now on Hugging Face
- LTX Desktop Beta - free download, open source
- Full weights, training framework, benchmarks, LoRAs, and the complete multimodal pipeline: text-to-video, image-to-video, video-to-video, audio-conditioned generation, depth conditioning
Try it, break it, tell us what's next. If you build something with it - or hit sharp edges - let us know in Discord. That's how this gets better.
