Production

How to Run a Video Generation Model Locally: A Low VRAM Guide

Set up local AI video generation on consumer GPUs. Covers VRAM tiers, pipeline selection, FP8 quantization, and ComfyUI workflows.

LTX Team
Start Now
How to Run a Video Generation Model Locally: A Low VRAM Guide
Table of Contents:
Key Takeaways:
  • Local LTX-2 video generation requires CUDA 13+ and an Nvidia GPU, with the distilled model plus FP8 quantization enabling 32GB VRAM workflows and full two-stage production pipelines targeting 80GB+.
  • Pipeline selection is the most impactful decision: DistilledPipeline for fast low-VRAM iteration, TI2VidTwoStagesPipeline for production quality, and TI2VidOneStagePipeline for quick prototyping without upsampling.
  • Key optimization levers are FP8 quantization, xFormers attention, gradient estimation denoising (reducing steps from 40 to 20-30), and stage-to-stage memory cleanup — combine these to fit larger workflows into constrained VRAM budgets.

Running AI video generation locally gives you full control over your workflow: no API rate limits, no per-second costs, and complete data privacy. The trade-off is that video generation demands significantly more GPU memory than image generation. Features like multi-stage sampling, reference video preprocessing, and temporal upsampling all compete for VRAM.

The good news is that modern open source video models include optimization paths specifically designed for consumer hardware. With the right pipeline selection, quantization settings, and workflow configuration, you can run production-capable video generation on GPUs with as little as 32GB of VRAM, and experimental generation on even less.

This guide covers what you need to get started, which pipeline to choose for your VRAM tier, how to optimize memory usage, and how to set up ComfyUI for local video generation using LTX-2.

Prerequisites: LTX-2 requires CUDA 13+ and targets Nvidia GPUs. The officially documented minimum VRAM is 32GB (distilled model with FP8 quantization on e.g. an RTX 5090). Community members have explored running LTX-2 on GPUs with less VRAM (24GB cards like the RTX 4090) using aggressive quantization — check the Discord for community-developed approaches. The distilled model variant with FP8 quantization supports GPUs with 32GB VRAM. Full-fidelity two-stage pipelines target 80GB+ VRAM. Python environment management uses uv.

What You Need to Run AI Video Generation Locally

Minimum Hardware Requirements

Local AI video generation has a hard floor: you need an Nvidia GPU with CUDA support. The minimum practical VRAM depends on which pipeline and model variant you run.

VRAM Tier Model Variant What You Can Run Limitations
32GB (documented minimum) Distilled + FP8 quantization (low-VRAM config) DistilledPipeline with 8-step inference, basic T2V and I2V No IC-LoRA, no full two-stage pipeline, limited resolution. Note: 32GB is the officially documented minimum. Community work on sub-32GB setups (e.g. RTX 4090 24GB) exists but is not officially supported — see Discord for community resources.
48GB (estimated, not officially tested) Dev or distilled, optional FP8 Two-stage pipelines, moderate resolutions, some IC-LoRA workflows May need memory optimization for complex workflows
80GB+ (recommended) Dev (full fidelity) All pipelines including IC-LoRA, A2V, retake, keyframe interpolation, full audio-video generation Hardware cost

Additional hardware considerations: System RAM of 32GB or more is recommended for loading model weights. SSD storage is important because checkpoint files (the dev model is ltx-2.3-22b-dev.safetensors, the distilled variant is ltx-2.3-22b-distilled.safetensors) are large, and loading from spinning disk adds significant startup time.

Software Dependencies

LTX-2 requires CUDA 13+ and runs on Python via the uv package manager. The setup is straightforward:

1. Clone the repository: git clone https://github.com/Lightricks/LTX-2.git

2. Set up the environment: cd LTX-2 && uv sync --frozen && source .venv/bin/activate

3. For attention optimizations, install xFormers: uv sync --extra xformers

4. For FP8 scaled matrix multiplication on Hopper GPUs: uv sync --frozen --extra fp8-trtllm

Downloading Model Weights

Download the required models from the LTX-2.3 HuggingFace repository:

Model checkpoint (choose one): ltx-2.3-22b-dev.safetensors (full fidelity) or ltx-2.3-22b-distilled.safetensors (fast inference)

Spatial upscaler (required for two-stage pipelines): ltx-2.3-spatial-upscaler-x2-1.0.safetensors or the 1.5x variant

Distilled LoRA (required for two-stage pipelines except DistilledPipeline and ICLoraPipeline): ltx-2.3-22b-distilled-lora-384.safetensors

Gemma 3 text encoder: Download all assets from the Gemma 3-12B repository

Choosing the Right Pipeline for Your VRAM

LTX-2 provides multiple pipelines optimized for different hardware tiers and use cases. Choosing the right one is the single most impactful decision for local deployment.

DistilledPipeline: Best for Low VRAM

The DistilledPipeline is the fastest inference option, using only 8 predefined sigmas (8 steps in stage 1, 4 steps in stage 2). It requires no guidance (no CFG computation), which significantly reduces memory overhead. This is the recommended starting point for anyone running on consumer hardware.

When to use: Fastest inference is critical, batch processing multiple videos, VRAM is constrained, initial prototyping before committing to full-fidelity renders.

TI2VidTwoStagesPipeline: Production Quality

The two-stage production pipeline generates video in two passes. Stage 1 produces a lower-resolution base video with full multimodal guidance (CFG, STG). Stage 2 upsamples to 2x resolution using a spatial upscaler with distilled LoRA refinement. This produces the highest quality output but requires more VRAM.

When to use: Final renders, production-quality output, sufficient VRAM (80GB+ recommended). The HQ variant (TI2VidTwoStagesHQPipeline) uses the res_2s second-order sampler for potentially fewer steps at comparable quality.

TI2VidOneStagePipeline: Quick Prototyping

Single-stage generation without upsampling. Supports multimodal guidance and image conditioning but produces lower-resolution output (typically 512x768). This pipeline is primarily for educational purposes and quick prototyping.

When to use: Learning the pipeline, testing prompts, when resolution is not a priority.

VRAM Optimization Techniques

FP8 Quantization

FP8 quantization reduces the transformer's memory footprint by storing weights in 8-bit floating point format. LTX-2 supports two FP8 backends:

FP8 Cast: The simpler approach. Casts weights to FP8 for storage and upcasts during inference. Enable via CLI with --quantization fp8-cast or in Python with QuantizationPolicy.fp8_cast().

FP8 Scaled MM (TensorRT-LLM): Uses Nvidia TensorRT-LLM's scaled matrix multiplication for efficient FP8 computation. Requires Hopper GPUs. Enable with --quantization fp8-scaled-mm or QuantizationPolicy.fp8_scaled_mm(). Supports optional static input quantization with calibration data for further optimization.

Attention Optimizations

Install xFormers (uv sync --extra xformers) for memory-efficient attention computation. For Hopper GPUs, Flash Attention 3 provides additional speedups. These optimizations reduce VRAM usage during the attention computation phase without affecting output quality.

Gradient Estimation Denoising

Gradient estimation reduces the number of inference steps required from 40 to 20-30 while maintaining quality. This directly reduces peak VRAM usage during the denoising loop because fewer intermediate states need to be held in memory. See the pipeline documentation for configuration details.

Memory Cleanup Between Stages

Two-stage pipelines can automatically clean up GPU memory between stages. If VRAM is tight, enabling this cleanup frees the stage 1 model before loading the stage 2 upsampler. If you have sufficient VRAM, disabling cleanup avoids the overhead of reloading models and speeds up total generation time.

Running Video Generation with ComfyUI

ComfyUI provides a visual node-based interface for building video generation workflows. LTX-2 integrates via the official ComfyUI-LTXVideo plugin.

Installing the ComfyUI Plugin

Follow the installation instructions at the ComfyUI-LTXVideo repository. The plugin adds nodes for all LTX-2 pipeline types, including text-to-video, image-to-video, IC-LoRA, and audio-to-video workflows.

Optimizing ComfyUI for Lower VRAM

When running ComfyUI on consumer hardware:

Use the distilled model and DistilledPipeline nodes for lowest memory usage

Enable FP8 quantization in the model loader node

Mute unused workflow branches to prevent ComfyUI from loading unnecessary models into VRAM

Start with shorter clips and lower resolutions for testing, then scale up once your workflow is stable

Using LTX Desktop for Local Video Generation

LTX Desktop provides a standalone desktop application for local AI video generation. It wraps the core LTX-2 pipelines in a graphical interface, handling model downloads, GPU detection, and workflow configuration automatically. For users who prefer a GUI over CLI or ComfyUI, LTX Desktop is the fastest path to local generation.

Troubleshooting Common Issues

Out of Memory (OOM) Errors

OOM errors during video generation typically mean your workflow exceeds available VRAM. The fix depends on where the crash occurs:

During model loading: Switch to the distilled variant or enable FP8 quantization

During denoising: Reduce resolution, reduce frame count, or switch from a two-stage to single-stage pipeline

During upsampling: Skip the upsampling stage temporarily, or reduce the target resolution

Frame count constraint: Video frame counts must satisfy (F-1) % 8 == 0. Valid frame counts include 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97. Using an invalid frame count may cause unexpected errors.

Slow Generation Times

If generation is slower than expected, check that attention optimizations are installed (xFormers or Flash Attention 3), FP8 quantization is enabled if supported by your hardware, memory cleanup between stages is disabled if you have sufficient VRAM, and gradient estimation is enabled to reduce the required step count.

Quality Issues at Lower VRAM Settings

Running with quantization and reduced settings involves quality trade-offs. FP8 quantization maintains most of the visual quality but may introduce subtle differences. The distilled model prioritizes speed over maximum fidelity. Lower resolutions and fewer frames reduce quality proportionally. For final production renders, the recommendation is to prototype on constrained hardware and render final outputs on a machine with 80GB+ VRAM, or use the hosted API as an alternative.

No items found.