- LTX-2 generates video natively at 25fps, but the KeyframeInterpolationPipeline can synthesize intermediate frames between keyframes to effectively double the frame rate for 50/60fps delivery without the VRAM cost of generating every frame from scratch.
- Diffusion-based interpolation outperforms traditional optical flow methods (RIFE, FILM) in complex scenarios involving camera movement, occlusion, and style-transferred content because the model understands 3D scene structure rather than just pixel displacement.
- The core workflow is: generate base video at 25fps, extract keyframes, run interpolation to fill gaps, then merge and export — using FP8 quantization and consistent prompts across both stages for best results.
AI video generation models produce output at fixed frame rates, and that rate is often lower than what delivery platforms require. A 25fps generation looks smooth in isolation but falls short when the target is 60fps for social media, gaming content, or broadcast compositing. Traditional frame rate conversion duplicates or blends existing frames, producing motion blur and judder. Frame interpolation generates entirely new intermediate frames, and when it is driven by a diffusion model that understands the scene, the results are significantly better.
LTX-2 includes two pipelines that handle frame interpolation natively: the KeyframeInterpolationPipeline (the primary tool for frame-rate upscaling between defined image keyframes) and the ICLoraPipeline (a complementary tool for video-to-video transformations). This guide explains both approaches, when to use each, and how to build a workflow that takes a generated video from its native frame rate to 60fps.
30fps vs 60fps: Why Frame Rate Matters for AI Video
The difference between 30fps and 60fps is perceptual, not just numerical. At 30fps, fast-moving objects exhibit noticeable motion blur and stuttering during camera pans. At 60fps, motion appears continuous and natural. For AI-generated video, frame rate also affects how viewers evaluate quality. Lower frame rates make temporal artifacts (flickering, warping) more visible because each frame is displayed longer.
When You Need 60fps
Social media platforms like TikTok and Instagram Reels display smoothest at 60fps. Gaming and interactive content requires 60fps as a baseline. VFX compositing pipelines expect source material at the delivery frame rate to avoid conversion artifacts when layering AI-generated elements over live-action footage. Marketing teams producing product demos and tutorials benefit from the perceived quality lift that higher frame rates provide.
Why AI Video Models Generate at Lower Frame Rates
Frame count directly multiplies VRAM consumption and compute time. The Video VAE in LTX-2 encodes frames with a temporal compression factor where the frame count must satisfy (F-1) % 8 == 0 (valid counts: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97). Generating at 60fps for even a few seconds requires significantly more frames, pushing VRAM requirements beyond what most hardware can handle in a single pass. Frame interpolation is the practical solution: generate at a lower frame rate, then synthesize the intermediate frames.
What Is Frame Interpolation?
Frame interpolation generates new frames between existing ones by predicting what the scene looks like at intermediate points in time. Unlike frame rate conversion (which resamples or duplicates), interpolation creates genuinely new visual information.
Traditional Methods vs Diffusion-Based Interpolation
Traditional optical flow methods (RIFE, FILM) estimate pixel motion between adjacent frames and warp them to create intermediates. These work well for smooth, predictable motion but struggle with occlusions, complex camera movements, and scenes where new content enters the frame. Diffusion-based interpolation, as implemented in LTX-2's pipelines, conditions the generation on actual keyframes and uses the full diffusion model to synthesize intermediate frames. This produces more coherent results for complex motion because the model understands scene structure, not just pixel displacement.
How LTX-2 Handles Frame Interpolation
LTX-2 provides two pipelines for frame interpolation, each designed for different use cases. KeyframeInterpolationPipeline is the documented interpolation tool; ICLoraPipeline is documented as a video-to-video transformation tool that complements interpolation workflows.
KeyframeInterpolationPipeline
The KeyframeInterpolationPipeline generates video by interpolating between keyframe images. It uses guiding latents (additive conditioning) instead of replacing latents for smoother transitions. This pipeline runs as a two-stage process: Stage 1 generates at half resolution with multimodal guidance, Stage 2 upsamples to 2x resolution. It supports image conditioning at multiple keyframe positions.
Best for: Interpolating between keyframe images, creating smooth transitions, animation and motion interpolation tasks where you define the start and end frames (and optionally intermediate keyframes).
ICLoraPipeline
The ICLoraPipeline performs video-to-video transformations using IC-LoRA (In-Context LoRA). It conditions on reference videos and images at specific frames, using CFG guidance in Stage 1 and upsampling in Stage 2. IC-LoRA can only be used with the distilled model.
Best for: Video-to-video transformations where you have a reference video to guide generation, or image-to-video with strong temporal control. It preserves the motion structure of the reference while applying the generation model's understanding of the scene.
Step-by-Step: Upscale AI Video from 30fps to 60fps
The workflow for frame rate upscaling with LTX-2 involves generating a base video, extracting keyframes, running interpolation, and merging the results.
Prerequisites: LTX-2 requires CUDA 13+ as a hard prerequisite. Native generation rates documented for LTX-2 are 24, 25, 48, and 50fps; 60fps is not a native generation rate. The "30fps to 60fps" framing in the title is industry shorthand for a doubled effective playback rate. The workflow generates natively at 25fps and produces a doubled output that can be delivered at 50fps for native LTX-2 rates or rendered to 60fps in your delivery encoder for platforms that expect it.
Step 1: Generate Your Base Video
Generate your video using the TI2VidTwoStagesPipeline (recommended for production quality) or the DistilledPipeline (for fastest inference). Set the frame rate to 25fps (the documented LTX-2 native rate) and a valid frame count like 97 frames (approximately 3.9 seconds at 25fps).
python -m ltx_pipelines.ti2vid_two_stages --checkpoint-path /path/to/ltx-2.3-22b-dev.safetensors --distilled-lora /path/to/distilled_lora.safetensors 0.8 --spatial-upsampler-path /path/to/upsampler.safetensors --gemma-root /path/to/gemma --prompt "A person walking through a sunlit forest, camera slowly tracking forward" --output-path base_video.mp4 --num-frames 97 --frame-rate 25
Step 2: Extract Keyframes for Interpolation
From the base video, extract frames at regular intervals to serve as keyframe anchors. For doubling the frame rate, extract every other frame as a keyframe pair. Use ffmpeg or a Python script to extract and prepare frame images.
Step 3: Run Keyframe Interpolation
Use the KeyframeInterpolationPipeline to generate intermediate frames between each keyframe pair. This pipeline uses guiding latents rather than replacing latents, which produces smoother transitions. The pipeline takes keyframe images as conditioning input alongside the standard checkpoint and prompt arguments.
python -m ltx_pipelines.keyframe_interpolation --checkpoint-path /path/to/ltx-2.3-22b-dev.safetensors --distilled-lora /path/to/distilled_lora.safetensors 0.8 --spatial-upsampler-path /path/to/upsampler.safetensors --gemma-root /path/to/gemma --keyframe-paths /path/to/keyframe_start.png /path/to/keyframe_end.png --prompt "A person walking through a sunlit forest" --output-path interpolated_segment.mp4
The exact name of the keyframe input flag may vary across releases. Run python -m ltx_pipelines.keyframe_interpolation --help to confirm the keyframe argument names for your build, then substitute the placeholder image paths above with your extracted keyframes.
Step 4: Merge and Export at 60fps
Interleave the original frames with the generated intermediate frames and export at 60fps. The original frames maintain temporal anchoring while the interpolated frames fill the gaps. Quality assessment at this stage should focus on motion smoothness in areas of fast movement and consistency in fine details like facial features and text.
Configuration and Parameter Tuning
Multimodal Guidance Settings for Interpolation
The multimodal guidance parameters control how strongly the generation follows the text prompt and how temporally coherent the output is. For interpolation, start with these values:
- cfg_scale: 3.0 for the video guider (moderate prompt adherence without overriding the keyframe conditioning)
- stg_scale: 1.0 for spatio-temporal guidance (improves temporal coherence between interpolated and original frames)
- stg_blocks: [29] (perturbs the last transformer block for STG)
- rescale_scale: 0.7 (prevents over-saturation in interpolated frames)
Higher cfg_scale values increase prompt adherence but can reduce the naturalness of motion in interpolated frames. For pure frame interpolation where the prompt is secondary to the visual content, values between 2.0 and 3.0 work best.
Frame Count Constraints
The Video VAE requires frame counts that satisfy (F-1) % 8 == 0. When planning interpolation segments, ensure each segment uses a valid frame count. Valid examples: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97. Segments that do not meet this constraint will fail at the encoding stage.
When Diffusion-Based Interpolation Wins
Optical flow methods like RIFE are faster for simple scenes with predictable motion. But diffusion-based interpolation through LTX-2's pipelines produces better results in three specific scenarios: complex camera motion (dolly, jib, orbital) where optical flow struggles with parallax, scenes with significant occlusion and disocclusion as objects move, and scenes requiring temporal consistency across style-transferred or generated content. The model's understanding of 3D scene structure through its 3D RoPE positional encoding gives it an advantage over purely 2D pixel-based methods.
Practical Tips
- Keep the prompt consistent between base generation and interpolation. Mismatched prompts cause visual discontinuity between original and interpolated frames
- Use FP8 quantization to reduce VRAM when running interpolation. The
--quantization fp8-castflag enables lower memory footprint without additional dependencies. SetPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truebefore running with FP8. The full command pattern:PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m ltx_pipelines.keyframe_interpolation ... --quantization fp8-cast - Start with short segments (17 or 25 frames) to test quality before committing to full-length interpolation
- Assess quality on motion-heavy sections first. Smooth static scenes will look good regardless. Fast motion and camera movement are where interpolation quality matters most
Frame interpolation bridges the gap between what AI video models generate natively and what delivery platforms demand. By combining LTX-2's open-source pipelines with a targeted interpolation workflow, you can produce 60fps output without the VRAM cost of generating every frame from scratch.
