Fireworks AI · Rate Limits

Fireworks Ai Rate Limits

Fireworks AI publishes high serverless rate limits that scale with paid spend, expressed primarily as RPM (requests per minute) per model, with separate limits for batch jobs and fine-tuning. On-demand dedicated deployments are bounded by provisioned GPU capacity rather than shared serverless limits.

6 Limits Throttle: 429

AILLMInferenceMultimodalFine-tuningRate LimitingQuotasThrottling

Limits

Requests Per Minute (RPM) account

requests

see provider documentation

Per-model RPM, varies by tier and model.

Concurrent Requests account

concurrent

see provider documentation

Concurrency cap per model on serverless.

Tokens Per Minute (TPM) account

tokens

see provider documentation

Per-model TPM, varies by tier and model.

Batch Inference account

jobs

separate from sync limits

Batch runs at 50% discount and does not consume sync RPM/TPM directly.

Fine-Tuning Jobs account

concurrent_jobs

see provider documentation

Concurrency cap on parallel fine-tuning jobs.

On-Demand Deployments deployment

requests

bounded by provisioned GPU capacity

Throughput determined by GPU sizing and autoscaling configuration.

Policies

Tiered Limits

Limits scale up automatically with paid postpaid usage and via Enterprise agreements.

Backoff Strategy

Clients should implement exponential backoff with jitter and honor Retry-After.

Fireworks Ai Rate Limits

Limits

Policies

Sources