- LoRA fine-tuning adds small trainable adapter layers to a frozen LTX-2 base model, enabling custom styles, effects, and domain-specific behaviors with a fraction of the memory and compute required for full fine-tuning.
- The training workflow follows four steps: organize and split videos into scenes, generate captions, preprocess into cached latents and embeddings, then configure and run training via a YAML file on a single or multi-GPU setup.
- Trained LoRA adapters are small portable files that load alongside the base model at inference time, and multiple adapters can be stacked for combined effects — with an 80GB+ GPU recommended, or 32GB with the low-VRAM INT8 configuration.
LoRA fine-tuning has transformed how practitioners customize generative models without retraining them from scratch. In the context of video generation, LoRA (Low-Rank Adaptation) lets you train lightweight adapter layers that modify a frozen base model's behavior — learning new styles, effects, or domain-specific content while keeping the original model weights intact.
The result is a small, portable weights file (typically a few hundred megabytes) that can be loaded alongside the base model at inference time.
This tutorial walks through the complete LoRA training workflow using the LTX-2 open-source trainer, covering dataset preparation, configuration, training execution, and running inference with your custom adapter.
What Is LoRA Fine-Tuning for Video Models?
LoRA fine-tuning adds small, trainable adapter layers to specific modules of a pre-trained model while keeping the base model frozen. Instead of updating all 14 billion parameters in LTX-2's video stream (or the 5 billion in its audio stream), LoRA trains only a fraction of additional parameters — dramatically reducing both memory requirements and training time.
For video generation specifically, LoRA training must account for temporal consistency across frames. LTX-2's architecture processes video through a diffusion transformer with 48 shared transformer blocks, where 3D RoPE positional encoding handles spatial and temporal dimensions simultaneously. This means LoRA adapters trained on LTX-2 can learn motion patterns, visual styles, and temporal behaviors — not just static visual features.
When to Use LoRA vs Full Fine-Tuning
Standard LoRA is ideal for learning specific styles, effects, or concepts. It requires significantly less memory and compute than full fine-tuning, produces small portable weight files, and trained adapters can be easily combined with other LoRAs during inference. Full model fine-tuning, by contrast, updates all parameters and offers maximum flexibility — but it requires distributed training across multiple GPUs using FSDP, and produces checkpoint files that are tens of gigabytes in size.
For most practitioners customizing a video generation model for a specific domain, LoRA is the recommended starting point.
Prerequisites and Hardware Requirements
Before starting, ensure your environment meets the following requirements documented for the LTX-2 trainer:
• GPU: An NVIDIA GPU with 80GB+ VRAM is recommended for the standard configuration. For GPUs with 32GB VRAM (such as the RTX 5090), a low-VRAM configuration is available that enables INT8 quantization and other memory optimizations.
• Operating system: Linux with CUDA 13+ (the trainer requires triton, which is Linux-only).
• Model checkpoint: A local .safetensors file containing the LTX-2 model weights, downloaded from HuggingFace.
• Gemma text encoder: The Gemma 3 model directory, downloaded from HuggingFace.
If these hardware requirements exceed your current setup, the LTX-2 hosted API provides a lower-barrier alternative for video generation without local GPU resources.
Installation
Clone the LTX-2 repository and install dependencies from the repository root:
git clone https://github.com/Lightricks/LTX-2
cd LTX-2
uv sync
cd packages/ltx-trainer
The trainer package depends on ltx-core and ltx-pipelines, which are automatically installed from the monorepo.
Preparing Your Training Dataset
Dataset preparation follows a structured workflow: optionally split long videos into scenes, optionally generate captions, then preprocess everything into cached latents and embeddings.
Step 1: Organize Your Videos
If you are starting with long-form footage, split it into shorter, coherent scenes:
uv run python scripts/split_scenes.py input.mp4 scenes_output_dir/ --filter-shorter-than 5s
Step 2: Generate Captions
If your dataset does not include captions, generate them automatically:
uv run python scripts/caption_videos.py scenes_output_dir/ --output scenes_output_dir/dataset.json
The captioner supports two modes: qwen_omni (local, default) and gemini_flash (API-based). For lower VRAM usage, add the --use-8bit flag.
Step 3: Preprocess the Dataset
This step computes and caches video latents and text embeddings. If you are training a video-only LoRA, run:
uv run python scripts/process_dataset.py dataset.json --resolution-buckets "960x544x49" --model-path /path/to/ltx-2-model.safetensors --text-encoder-path /path/to/gemma-model
If you are training an audio-video LoRA (using ltx2_av_lora.yaml), add the --with-audio flag so audio latents are generated:
uv run python scripts/process_dataset.py dataset.json --resolution-buckets "960x544x49" --model-path /path/to/ltx-2-model.safetensors --text-encoder-path /path/to/gemma-model --with-audio
To add a trigger word that activates your LoRA during inference, include --lora-trigger "MYTRIGGER" in the preprocessing command. This prepends the token to all captions, so you can activate the adapter at inference time by including the trigger in your prompt.
Resolution buckets define the target dimensions. Spatial dimensions must be multiples of 32 and frame counts must satisfy (F-1) % 8 == 0.
Configuring Your LoRA Training
Training is driven by YAML configuration files. The LTX-2 trainer includes ready-made examples for audio-video LoRA, low-VRAM LoRA, and IC-LoRA training. Before running training, open your chosen config and update at minimum these four fields:
model:
model_path: /path/to/ltx-2-model.safetensors
text_encoder_path: /path/to/gemma-model
data:
preprocessed_data_root: /path/to/preprocessed-dataset
output_dir: /path/to/save-lora-weights
Running the Training Job
Single-GPU Training
Start training with a single command:
uv run python scripts/train.py configs/ltx2_av_lora.yaml
Multi-GPU Training
For distributed training, the LTX-2 trainer uses Hugging Face Accelerate with DDP and FSDP support. The basic multi-GPU command is:
uv run accelerate launch scripts/train.py configs/ltx2_av_lora.yaml
For a specific Accelerate configuration, use the pre-built configs in configs/accelerate/:
uv run accelerate launch --config_file configs/accelerate/ddp.yaml scripts/train.py configs/ltx2_av_lora.yaml
Ready-to-use configs include ddp.yaml, ddp_compile.yaml, fsdp.yaml, and fsdp_compile.yaml.
Using Your Trained LoRA for Inference
After training completes, load the trained LoRA using SingleGPUModelBuilder from ltx_core.loader. Multiple LoRA adapters can be stacked by chaining additional .lora() calls:
from ltx_core.loader import SingleGPUModelBuilder
import torch
builder = SingleGPUModelBuilder(
model_class_configurator=...,
model_path="/path/to/ltx-2-model.safetensors"
).lora("path/to/your_lora.safetensors", strength=0.8)
model = builder.build(device=torch.device("cuda"))
Conclusion
LoRA fine-tuning makes it practical to customize a state-of-the-art video generation model for your specific needs without massive compute budgets. The LTX-2 open-source trainer provides a complete, configuration-driven workflow — from scene splitting and captioning through preprocessing, training, and inference.
Ready to start training? Clone the LTX-2 repository, prepare your dataset, and run your first LoRA training job. Join the LTX community on Discord to share your results and connect with other practitioners.
.jpeg)