Fireworks AI · Rate Limits
Fireworks Ai Rate Limits
Fireworks AI publishes high serverless rate limits that scale with paid spend, expressed primarily as RPM (requests per minute) per model, with separate limits for batch jobs and fine-tuning. On-demand dedicated deployments are bounded by provisioned GPU capacity rather than shared serverless limits.
6 Limits
Throttle: 429
AILLMInferenceMultimodalFine-tuningRate LimitingQuotasThrottling
Limits
Requests Per Minute (RPM) account
see provider documentation
Per-model RPM, varies by tier and model.
Concurrent Requests account
see provider documentation
Concurrency cap per model on serverless.
Tokens Per Minute (TPM) account
see provider documentation
Per-model TPM, varies by tier and model.
Batch Inference account
separate from sync limits
Batch runs at 50% discount and does not consume sync RPM/TPM directly.
Fine-Tuning Jobs account
see provider documentation
Concurrency cap on parallel fine-tuning jobs.
On-Demand Deployments deployment
bounded by provisioned GPU capacity
Throughput determined by GPU sizing and autoscaling configuration.
Policies
Tiered Limits
Limits scale up automatically with paid postpaid usage and via Enterprise agreements.
Backoff Strategy
Clients should implement exponential backoff with jitter and honor Retry-After.