Together AI · Rate Limits

Together Ai Rate Limits

Together AI enforces per-account rate limits on serverless inference that vary by model and account tier (Build / Scale / Enterprise as account spend/credit grows). Limits include requests-per-minute (RPM) and tokens-per-minute (TPM) per model. Specific per-model values are not reconciled in this artifact - see the Together console for active limits on your account.

5 Limits Throttle: 429

AILLMInferenceOpen SourceFine-tuningRate LimitingQuotasThrottling

Limits

Requests Per Minute (RPM) account

requests

see provider documentation

Per-model RPM, varies by tier and model. Pending reconciliation.

Tokens Per Minute (TPM) account

tokens

see provider documentation

Per-model TPM, varies by tier and model. Pending reconciliation.

Concurrent Fine-Tuning Jobs account

jobs

see provider documentation

Concurrency cap on parallel fine-tuning jobs.

Batch Job Size / Concurrency account

jobs

see provider documentation

Batch jobs are queued and do not consume serverless RPM/TPM directly.

Dedicated Endpoints endpoint

requests

bounded by provisioned GPU capacity

Throughput is determined by the dedicated hardware sizing.

Policies

Tiered Limits

Limits scale up automatically with account spend / credit balance and via Enterprise agreements.

Backoff Strategy

Clients should implement exponential backoff with jitter and honor any Retry-After header.

Together Ai Rate Limits

Limits

Policies

Sources