Together AI · Rate Limits
Together Ai Rate Limits
Together AI enforces per-account rate limits on serverless inference that vary by model and account tier (Build / Scale / Enterprise as account spend/credit grows). Limits include requests-per-minute (RPM) and tokens-per-minute (TPM) per model. Specific per-model values are not reconciled in this artifact - see the Together console for active limits on your account.
5 Limits
Throttle: 429
AILLMInferenceOpen SourceFine-tuningRate LimitingQuotasThrottling
Limits
Requests Per Minute (RPM) account
see provider documentation
Per-model RPM, varies by tier and model. Pending reconciliation.
Tokens Per Minute (TPM) account
see provider documentation
Per-model TPM, varies by tier and model. Pending reconciliation.
Concurrent Fine-Tuning Jobs account
see provider documentation
Concurrency cap on parallel fine-tuning jobs.
Batch Job Size / Concurrency account
see provider documentation
Batch jobs are queued and do not consume serverless RPM/TPM directly.
Dedicated Endpoints endpoint
bounded by provisioned GPU capacity
Throughput is determined by the dedicated hardware sizing.
Policies
Tiered Limits
Limits scale up automatically with account spend / credit balance and via Enterprise agreements.
Backoff Strategy
Clients should implement exponential backoff with jitter and honor any Retry-After header.