What is a diffusion model?

In 2020, diffusion models were a niche research direction. Three years later, they were generating photorealistic images, synthetic speech, and molecular structures, having displaced GANs as the dominant generative architecture across nearly every modality.

Today, every major video generation system is built on one. LTX-2 is no exception.

Diffusion model definition

A diffusion model is a class of generative AI model that learns to create data by reversing a gradual noising process. During training, it learns to undo noise, step by step.

At inference, it starts from random noise and progressively refines it into a coherent output: an image, a video clip, or an audio track.

The name comes from physics. Diffusion describes how particles spread from areas of high concentration to low concentration over time.

In the original formulation, training data is "diffused" by adding noise across many timesteps until it becomes indistinguishable from pure Gaussian noise. The model learns to reverse this.

How diffusion models work

The process has two phases: forward and reverse.

In the forward process, a training sample has noise added incrementally across T timesteps. At each step, a small amount of Gaussian noise is introduced until the original sample is completely destroyed.

The schedule governing how noise is added at each step is called the noise schedule.

In the reverse process, the model learns to predict and remove noise at each timestep. Given a noisy sample at step t, it estimates what the sample looked like at step t-1.

By chaining these predictions from t=T back to t=0, it reconstructs a clean sample from pure noise.

At inference time, the forward process is skipped entirely. You sample random noise and run the reverse process. That is what makes generation possible: random noise in, coherent output out.

The core component is the denoising network, typically a neural network that takes the noisy input and the current timestep as inputs, then predicts the noise to subtract.

In modern systems this network is almost always a transformer or a U-Net variant.

Types of diffusion models

Not all diffusion models are identical. Several major variants have emerged:

Denoising Diffusion Probabilistic Models (DDPMs) are the original formulation, introduced by Ho et al. in 2020.

They define the noising process as Markovian (each step depends only on the previous) and train the model to predict the noise added at each step.

Score-based generative models take a different mathematical angle. Instead of predicting noise directly, they estimate the gradient of the data distribution (the "score"), which guides denoising.

Score-based models and DDPMs are mathematically equivalent under certain parameterizations.

Latent diffusion models (LDMs) run the diffusion process not on raw pixels but in a compressed latent space learned by a variational autoencoder (VAE).

This is the architecture behind Stable Diffusion and LTX-2. Operating in latent space reduces the computational cost of training and inference substantially without sacrificing output quality.

Diffusion Transformers (DiTs) replace the U-Net denoising network with a transformer architecture. Introduced by Peebles and Xie in 2022, DiTs scale better with model size and have become the dominant backbone for state-of-the-art video and image models. LTX-2 is a DiT operating in latent space.

Flow Matching reframes generative modeling as learning a flow field that transforms a noise distribution into the data distribution.

It produces straighter probability paths than traditional diffusion, which means fewer sampling steps and faster inference. LTX-2 uses Flow Matching as its training objective, contributing to generation at 1/5 to 1/10 the compute cost of earlier models.

A brief history of diffusion models

The probabilistic foundations trace to Sohl-Dickstein et al. (2015), but the modern era started in 2020. Ho et al.'s DDPM paper demonstrated that diffusion models could generate high-quality images competitive with GANs.

Song and Ermon's score-based modeling work provided a continuous-time generalization that unified several threads of research.

The field moved fast after that. DALL-E 2, Stable Diffusion, and Imagen applied diffusion to text-conditional generation at scale. Rombach et al.'s latent diffusion paper (2022) made high-resolution image generation computationally tractable.

By 2023, the architecture had expanded to video, audio, 3D, and molecular biology. LTX-2 is a direct continuation of this line: a video-native diffusion transformer trained with Flow Matching, designed for production deployment.

Why diffusion models power video generation

Generating video is harder than generating images. A 5-second clip at 720p contains hundreds of individual frames that must be not only individually coherent but temporally consistent.

Objects cannot shift between frames, lighting should hold across cuts, and motion needs to look physical.

Diffusion models handle this well for two reasons. First, iterative refinement lets the model gradually enforce both spatial (per-frame) and temporal (across-frame) constraints.

Second, operating in latent space makes the dimensionality of video tractable without prohibitive compute.

LTX-2 extends this with spatiotemporal attention, which allows the model to reason about spatial content within frames and temporal relationships across frames simultaneously, within a single transformer architecture.

This is what produces the motion quality and temporal consistency that frame-by-frame approaches struggle to match.

Diffusion models vs. other generative approaches

Before diffusion, generative AI was dominated by GANs and VAEs. GANs produce sharp outputs but are unstable to train and prone to mode collapse, where the model generates only a limited range of outputs.

VAEs are stable but tend to produce blurry results. Autoregressive models generate token by token and scale well for text but are computationally expensive for high-dimensional data like video.

Diffusion models sit at a different tradeoff: slower sampling than GANs (requiring multiple forward passes), but better coverage of the data distribution, more stable training, and more controllable outputs through techniques like classifier-free guidance (CFG).

The speed gap has closed significantly through work on reducing sampling steps (DDIM, Flow Matching, Consistency Models). LTX-2's use of Flow Matching is a direct result of this research direction.

How LTX-2 uses diffusion

LTX-2 is a 20.9-billion-parameter multimodal diffusion transformer. It operates in a compressed latent space encoded by LTX-2's custom VAE, processes text, image, audio, and video as unified conditioning signals, and uses Flow Matching as its core training objective.

The outputs: native 4K, up to 50 fps, synchronized audio-video generation, and inference at 1/5 to 1/10 the compute cost of comparable models. It runs locally on consumer-grade GPUs with no cloud dependency required, via LTX Desktop.

For developers integrating via the LTX-2 API or running the open-weight model directly, the diffusion process is what runs every time you call a generation endpoint.

The sampling steps parameter controls how many reverse-diffusion steps the model takes. More steps produce higher-quality outputs at the cost of additional compute.

LTX-2's Flow Matching architecture means you get strong results at step counts that would produce poor outputs from a traditional DDPM.

What Is A Diffusion Model? Definition & Types