Production

Quantization Formats For Faster Local AI Video Inference: FP8, MXFP8 & NVFP4 Explained

Learn how FP8, MXFP8, and NVFP4 quantization formats reduce VRAM usage and speed up local AI video inference. Includes LTX-2 setup steps and CLI flags.

LTX Team
Start Now
Quantization Formats For Faster Local AI Video Inference: FP8, MXFP8 & NVFP4 Explained
Table of Contents:
Key Takeaways:
  • Quantization reduces model weight precision to lower VRAM requirements for local inference — LTX-2 currently supports FP8 natively, cutting VRAM by ~40% with minimal quality loss, via either fp8-cast (any compatible GPU) or fp8-scaled-mm (Hopper GPUs with TensorRT-LLM).
  • MXFP8 improves on standard FP8 by applying block-level scaling to preserve more dynamic range, while NVFP4 halves memory again to ~10GB for a 20B model — both tied to NVIDIA Blackwell architecture, with an official NVFP4 checkpoint already available on HuggingFace.
  • For most local deployments today, start with fp8-cast combined with the distilled model and gradient estimation denoising to run LTX-2 on 32GB GPUs at production quality.

Running a 20.9-billion-parameter video model on a local GPU means every byte of memory counts. Full-precision weights demand more VRAM than most consumer and even many professional cards can offer, and that gap between model size and available memory is where quantization becomes essential. Choosing the right numeric format for inference determines whether you can run locally at all, how fast each generation completes, and how much visual quality you retain.

This guide explains three quantization formats relevant to AI video inference: FP8, MXFP8, and NVFP4. It covers what each format is, how they compare in precision and memory savings, and how to configure FP8 quantization in LTX-2 today.

What Is Quantization in AI Video Generation?

Quantization reduces the numeric precision of model weights from higher-bit formats (FP32 or FP16) to lower-bit representations. For video generation models, this is more consequential than for image or text models because video adds a temporal dimension. A single generation pass processes hundreds of spatial-temporal tokens across 48 transformer blocks, and every weight stored at full precision multiplies VRAM consumption by the frame count and resolution.

How Numeric Precision Affects VRAM and Speed

A model weight stored in FP32 occupies 4 bytes. FP16 cuts that to 2 bytes. FP8 halves it again to 1 byte, and FP4 formats push down to 0.5 bytes per weight. For a model with billions of parameters, these reductions translate directly to gigabytes of freed VRAM. The trade-off is always precision: fewer bits mean a narrower range of representable values, which can introduce rounding errors that affect generation quality.

In practice, the quality impact depends on which layers are quantized and how. Quantizing transformer linear layers (where most parameters live) while keeping critical components like the VAE and text encoder at full precision is the standard approach for preserving output quality while reducing the memory footprint substantially.

Why Video Models Are Different from LLMs

Large language models process 1D token sequences. Video diffusion models process 3D spatial-temporal latents across dual streams (14B-parameter video stream and 5B-parameter audio stream in the case of LTX-2). The memory pressure is higher because the denoising loop runs multiple forward passes per generation step, and two-stage pipelines double the compute by first generating at low resolution, then upsampling. Quantization is not optional for local deployment. It is a prerequisite.

FP8 Quantization Explained

FP8 (8-bit floating point) is the most mature and widely supported reduced-precision format for AI inference. It stores each weight in a single byte using two standard variants: E4M3 (4 exponent bits, 3 mantissa bits) optimized for weights and activations, and E5M2 (5 exponent bits, 2 mantissa bits) optimized for gradients. For inference, E4M3 is the relevant variant because it preserves more precision in the value range that matters for model weights. In LTX-2, enabling FP8 quantization reduces VRAM usage by approximately 40% with minimal quality loss.

FP8 Cast vs FP8 Scaled Matrix Multiplication

LTX-2 supports two FP8 quantization policies, each suited to different hardware and use cases.

FP8 Cast downcasts transformer linear weights to FP8 during model loading and upcasts them back to higher precision on the fly during each inference step. This approach requires no extra dependencies and works on any GPU with FP8 support. It reduces the memory footprint for weight storage while keeping computation in higher precision.

FP8 Scaled Matrix Multiplication uses NVIDIA TensorRT-LLM's cublas_scaled_mm for efficient FP8 matrix multiplication. Weights are stored in FP8 format with per-tensor scaling, and inputs are quantized dynamically (or statically with a calibration file). This backend is optimized for Hopper-architecture GPUs (H100, H200) and delivers both memory savings and faster computation by running the actual math in FP8.

PolicyFP8 CastFP8 Scaled MM
CLI Flag--quantization fp8-cast--quantization fp8-scaled-mm
DependenciesNone (built-in)TensorRT-LLM required
GPU RequirementAny GPU with FP8 supportHopper GPUs (H100, H200)
Memory SavingsWeight storage only (~40% VRAM reduction)Weight storage + compute (~40% VRAM reduction)
Speed BenefitModerate (less VRAM pressure)Higher (native FP8 math)

How to Enable FP8 in LTX-2

Both policies require setting the PYTORCH_CUDA_ALLOC_CONF environment variable to enable expandable memory segments. Here is the CLI configuration for each:

FP8 Cast (works on any GPU with FP8 support):

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m ltx_pipelines.ti2vid_two_stages --quantization fp8-cast --checkpoint-path /path/to/ltx-2.3-22b-dev.safetensors

FP8 Scaled MM (requires TensorRT-LLM, best on Hopper GPUs):

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m ltx_pipelines.ti2vid_two_stages --quantization fp8-scaled-mm --checkpoint-path /path/to/ltx-2.3-22b-dev.safetensors

For programmatic usage in Python scripts, pass a QuantizationPolicy to the pipeline constructor:

from ltx_core.quantization import QuantizationPolicy

pipeline = TI2VidTwoStagesPipeline(checkpoint_path=ltx_model_path, distilled_lora=distilled_lora, spatial_upsampler_path=upsampler_path, gemma_root=gemma_root_path, loras=[], quantization=QuantizationPolicy.fp8_cast())

The FP8 Scaled MM backend also supports static input quantization with a calibration file for more consistent quality: QuantizationPolicy.fp8_scaled_mm(calibration_amax_path="/path/to/amax.json").

MXFP8: Microscaling for Better Accuracy

MXFP8 (Microscaling FP8) is an industry-standard format defined by the Open Compute Project's Microscaling Formats Specification. While standard FP8 applies a single scale factor per tensor or per channel, MXFP8 applies scale factors at the block level, typically to groups of 32 elements. This finer-grained scaling preserves more dynamic range within each block, reducing the quantization error that accumulates in layers with highly variable weight distributions.

MXFP8 vs FP8: When Microscaling Matters

The practical difference between MXFP8 and standard FP8 shows up in layers where weight magnitudes vary significantly within a single tensor. Standard per-tensor scaling forces all values to share one scale factor, which can clip outliers or waste precision on small values. Block-level microscaling lets each group of 32 elements use its own scale, preserving both outliers and fine detail. For video diffusion models with 48 transformer blocks and distinct video and audio streams, this granularity can reduce quality degradation in the audio-visual cross-attention layers where weight distributions differ most between modalities.

MXFP8 hardware support is available on NVIDIA Blackwell-architecture GPUs (B100, B200, GB200). As MXFP8 support expands in the PyTorch ecosystem, it represents a path toward better quality at the same 8-bit memory footprint.

NVFP4: NVIDIA's 4-Bit Floating Point Format

NVFP4 is NVIDIA's proprietary 4-bit floating-point format, introduced with the Blackwell GPU architecture. At half a byte per weight, NVFP4 roughly halves the memory footprint compared to FP8, enabling models that were previously GPU-memory-bound to fit on smaller hardware or run larger batch sizes.

NVFP4 Inference Speed and VRAM Reduction

The memory savings with NVFP4 are significant. A 20B-parameter model stored in FP8 requires approximately 20 GB for weights alone. NVFP4 reduces that to approximately 10 GB. Combined with 4-bit Tensor Core operations on Blackwell GPUs, both memory bandwidth and compute throughput improve. For local inference workflows where VRAM is the primary constraint, this opens the possibility of running large video models on hardware that previously required FP8 or FP16 to function.

Trade-offs: When 4-Bit Is and Isn't Enough

Four bits provide only 16 discrete values per weight. Even with microscaling (NVFP4 uses block-level scaling similar to MXFP8), the reduced precision introduces more quantization noise than 8-bit formats. For video generation, where temporal consistency depends on precise weight values across frames, 4-bit quantization may produce visible artifacts in smooth motion sequences or fine texture details. The suitability of NVFP4 depends on the specific pipeline, resolution, and quality requirements.

An official Lightricks NVFP4 checkpoint is available on HuggingFace at Lightricks/LTX-2.3-nvfp4, built using quantization-aware distillation. Running it requires CUDA 12.7 or newer on a compatible GPU.

Comparing FP8, MXFP8, and NVFP4 for AI Video

FormatBitsScalingGPU Requirement
FP8 (E4M3)8Per-tensor or per-channelAda Lovelace, Hopper, Blackwell
MXFP88Block-level (32 elements)Blackwell
NVFP44Block-level microscalingBlackwell

Which Format Should You Use?

If you are running LTX-2 locally today, FP8 is the practical choice and the only quantization format currently supported natively in the LTX-2 codebase. The fp8-cast policy works on any GPU with FP8 support (RTX 40-series and newer) and requires no additional dependencies. If you have access to a Hopper GPU and need maximum throughput, fp8-scaled-mm with TensorRT-LLM delivers native FP8 computation. MXFP8 and NVFP4 are forward-looking formats tied to Blackwell hardware; an official NVFP4 model is already published as a standalone HuggingFace checkpoint, while MXFP8 support continues to expand across the PyTorch ecosystem and other diffusion model deployments.

Combining Quantization with Other Optimizations

FP8 quantization is most effective when combined with other LTX-2 optimization techniques. The gradient estimation denoising loop reduces inference steps from 40 to 20-30 while maintaining quality, and attention optimizations through xFormers or Flash Attention 3 (for Hopper GPUs) reduce the compute cost of each step. For local inference, combining FP8 quantization with the distilled model variant (ltx-2.3-22b-distilled.safetensors with 8 predefined sigmas) and gradient estimation creates a pipeline that runs on 32 GB GPUs while maintaining production-level quality.

Getting Started

If you are setting up LTX-2 for local inference, start with FP8 Cast quantization. It requires no additional dependencies, works across GPU generations with FP8 support, and provides meaningful VRAM reduction without quality compromise. As your hardware and workflow evolve, the quantization landscape is moving toward finer-grained formats like MXFP8 and more aggressive compression with NVFP4.

Before starting, ensure your system has CUDA 13+ installed; this is a hard prerequisite for running LTX-2.

  1. Clone the LTX-2 repository and install dependencies: git clone https://github.com/Lightricks/LTX-2.git && cd LTX-2 && uv sync --frozen
  2. Download the model checkpoint and spatial upscaler from the LTX-2.3 HuggingFace repository
  3. Run your first quantized generation with the --quantization fp8-cast flag
  4. Monitor VRAM usage and adjust resolution and frame count to fit your hardware. Frame counts must follow the (F-1) % 8 == 0 constraint (valid examples: 57, 65, 97, 121).

The quantization format you choose is one piece of a larger optimization stack. Combine it with the right pipeline selection, sampling configuration, and hardware-aware settings to get the most from local AI video generation.

No items found.