What is a rate limit?

An API without limits is an invitation to abuse. Rate limits are how providers keep their infrastructure stable, ensure fair access across all users, and protect against runaway costs from bugs or attacks.

Definition

A rate limit is a constraint on how many API requests a client can make within a defined time window. Requests that exceed the limit receive an error response (typically HTTP 429 Too Many Requests) rather than being processed.

Rate limits are enforced at the API key level, the account level, or both, and they apply across dimensions including total requests per minute, total compute consumed per hour, and concurrent active jobs.

Why rate limits exist

Infrastructure stability: A single client making thousands of requests per second can saturate server resources, degrading performance for all other clients. Rate limits distribute load.

Cost fairness: Compute-intensive operations like video generation consume significant resources per request. Limits ensure that one client cannot consume a disproportionate share of shared infrastructure.

Abuse prevention: Rate limits limit the blast radius of compromised API keys, buggy client code sending duplicate requests, and intentional abuse.

Billing alignment: Limits are often set to align with billing tiers, with higher limits available at higher price points.

Types of rate limits

Requests per minute (RPM): The number of API calls allowed in a 60-second rolling window. Common for lightweight endpoints like status checks.

Tokens or compute per minute (TPM/CPM): For compute-intensive generation, providers often limit based on the amount of compute consumed rather than the raw number of requests. A 30-second 4K video generation consumes more than a 5-second 720p generation.

Concurrent jobs: The number of generation jobs that can be in progress simultaneously. Relevant for pipeline architectures that submit many jobs at once.

Daily or monthly quotas: Hard caps on total usage within a billing period, separate from per-minute limits.

Handling rate limits in production

Implement exponential backoff with jitter: When a 429 is received, wait before retrying. Use exponential backoff (doubling wait time on each retry) with added randomness (jitter) to prevent synchronized retry storms from multiple clients.

Queue requests: Rather than sending all requests simultaneously and hitting limits, use a job queue that controls submission rate and stays within the allowed window.

Monitor usage proactively: Track consumption against limits before hitting them. The LTX-2 API exposes usage data that you can monitor to stay within bounds.

Request limit increases: For production workloads that regularly approach limits, most providers offer higher tiers or custom enterprise agreements.

LTX-2 rate limits

Rate limits for the LTX-2 API are defined per plan, with higher limits available at higher pricing tiers. Current limits and pricing are on the API pricing page. For enterprise workloads requiring dedicated capacity without shared rate limits, the enterprise deployment option provides isolated infrastructure.

What Is A Rate Limit In AI & Why They Exist