What Is A VAE (Variational Autoencoder)? Meaning & How They Work

Get API key
Table of Contents:

What is a VAE?

Before a diffusion model generates anything, it needs a way to work in compressed form. The VAE is how that compression happens, and how it gets undone.

Definition

A VAE (Variational Autoencoder) is a neural network architecture with two components: an encoder that compresses high-dimensional input data (images or video frames) into a compact latent representation, and a decoder that reconstructs the original data from that representation.

In the context of generative video models, the VAE handles the translation between pixel space (the full-resolution video you see) and latent space (the compressed representation the diffusion or generation model actually works in).

The generation model never touches pixels directly. It operates on latent codes, and the VAE decodes those codes into the final video output.

Why generation models use VAEs

Running a diffusion model directly on raw pixels is computationally prohibitive for high-resolution video. A single frame at 4K resolution contains over 8 million pixels, each with three color channels. Generating video at that resolution by running diffusion over raw pixels would require orders of magnitude more compute than is practical.

A VAE encoder compresses each frame to a fraction of its original size, typically with a spatial compression factor of 8x or 16x. A 4K frame becomes a small latent tensor. The diffusion model runs in this compressed space, which is much cheaper. After generation is complete, the VAE decoder expands the latent back to full resolution.

The quality of the VAE determines the quality ceiling of the overall system. A VAE that loses fine details in encoding will produce outputs that look soft or lack high-frequency texture, no matter how good the generation model is.

How VAEs work

The encoder takes an input image or frame and produces a distribution in latent space, specifically a mean and variance for each dimension of the latent code. Rather than encoding to a single fixed point, the encoder produces a distribution, which allows sampling during training.

This stochasticity is what distinguishes a variational autoencoder from a standard autoencoder.

At training time, a sample is drawn from this distribution, and the decoder tries to reconstruct the original input from it.

The training loss has two components: reconstruction loss (how different is the output from the input?) and KL divergence (how far is the encoded distribution from a standard Gaussian?). The second term is what makes the latent space well-structured and smooth, enabling new samples to decode cleanly.

At inference in a generative pipeline, the encoder is used once at the start (to encode any conditioning frames) and the decoder is used once at the end (to decode the generated latent).

The diffusion process runs entirely in latent space between these two calls.

VAE quality and video generation

Video VAEs face additional challenges beyond image VAEs: they must encode temporal information as well as spatial information. A video VAE must compress not just what is in each frame but how frames relate to each other over time.

Poor temporal compression in the VAE produces videos where fine motion, such as subtle facial expressions or texture animation, is lost or inconsistent. The VAE effectively becomes the resolution and motion fidelity ceiling for the entire system.

How LTX-2 uses its VAE

LTX-2.3 introduced a new VAE with improved detail preservation compared to earlier versions. The update specifically addressed fine detail retention and better fidelity to the conditioning input for image-to-video generation.

The LTX-2 VAE is part of the open-weight model release, and can be used independently of the full generation pipeline for compression and reconstruction tasks.