What is transformer architecture?
In 2017, Google researchers published "Attention Is All You Need" and discontinued the dominant design pattern in deep learning. The recurrent neural network era ended. The transformer era began. Seven years later, almost every high-performance AI model in existence is built on some variant of the architecture that paper introduced.
Definition
Transformer architecture is a neural network design that processes input sequences by computing attention: a mechanism that determines, for each element in the sequence, how much it should be influenced by every other element. Unlike recurrent networks (LSTMs, GRUs) that process sequences step by step, transformers process the entire sequence in parallel, enabling much faster training on modern hardware.
The original transformer was designed for text. An attention head in a language model computes relationships between words in a sentence. The same principle applies to images (relationships between patches), audio (relationships between frames of spectrogram), and video (relationships between regions across space and time).
Core components
Self-attention is the mechanism that computes relationships between all elements in a sequence. For each element, a query, key, and value are computed. The attention score between two elements is the dot product of their query and key vectors, normalized and used to weight the values. The output for each position is a weighted sum of all values.
Multi-head attention runs multiple attention operations in parallel with different learned projections, allowing the model to attend to different types of relationships simultaneously.
Feed-forward layers follow each attention block, applying a two-layer MLP to each position independently. These layers are where most of the model's parameter count lives.
Layer normalization stabilizes training by normalizing the inputs to each sub-layer.
Positional encoding injects information about the position of each element in the sequence, since attention is position-agnostic without it.
Why transformers dominate
Recurrent networks process sequences one step at a time, which creates two problems: they cannot be parallelized easily during training, and they struggle to learn dependencies between elements that are far apart in the sequence.
Transformers solve both. Every element attends to every other element in a single operation, which parallelizes across GPU cores. Long-range dependencies are as easy to learn as short-range ones. This combination made transformers the architecture of choice for language models (GPT, BERT), image models (ViT), audio models (Whisper), and eventually video models.
Diffusion Transformers (DiT)
The Diffusion Transformer (DiT), introduced by Peebles and Xie in 2022, applies transformer architecture to the denoising network inside diffusion models, replacing the earlier U-Net backbone.
The key advantage is scalability: DiTs follow neural scaling laws more cleanly than U-Nets. More parameters, more compute, and more data consistently improve quality. This makes DiT a better foundation for scaling video generation to higher resolutions and longer clips.
How LTX-2 uses transformer architecture
LTX-2.3 is a Diffusion Transformer (DiT) with 20.9 billion parameters, operating in compressed latent space. It uses spatiotemporal attention that attends across both spatial dimensions within frames and temporal dimensions across frames simultaneously, in a single unified operation.
This joint spatiotemporal attention is what enables LTX-2 to reason about how objects move over time, rather than treating motion as a post-hoc consideration applied to independently generated frames.
The architecture was designed with research extensibility in mind. The full technical report is available through the LTX academic programs page, and the open weights support research extensions and fine-tuning through standard PyTorch and Diffusers interfaces.