DeepSeek · Rate Limits

Deepseek Rate Limits

DeepSeek does not publish fixed numerical rate limits. Instead, the API dynamically caps user concurrency based on current server load and returns HTTP 429 once a caller's concurrency ceiling is reached. There is no hard requests-per-minute or tokens-per-minute cap published on the public docs; sustained throughput is therefore best-effort and varies in real time. Inference connections that have not begun streaming within ten minutes of being accepted are closed by the server.

2 Limits Throttle: 429

AIArtificial IntelligenceChatLLMLarge Language ModelsRate Limiting

Limits

Dynamic concurrency cap api-key

concurrent_requests

dynamic — varies with current server load

DeepSeek throttles by concurrent in-flight requests rather than RPS. The exact ceiling is not published and shifts with load.

Inference connection timeout connection

seconds_to_first_token · second

600

If inference does not start within 10 minutes after the connection is accepted, the server closes the connection. Affects very long queues during load spikes.

Policies

Backoff Strategy

On HTTP 429, callers should back off and retry with exponential delay and jitter. DeepSeek does not document a Retry-After header, so clients must use their own retry budget.

Keep-Alive Signaling

While a request is queued the server returns empty lines (non-streaming) or SSE keep-alive comments (streaming). Clients must tolerate these placeholder frames.

Connection Lifetime

Drop and re-establish connections that have been idle for more than 10 minutes without any token output, since the server will close them.

Concurrency Sizing

Because limits are dynamic, callers should target a configurable concurrency budget per API key and reduce parallelism when 429 responses spike.

Deepseek Rate Limits

Limits

Policies

Sources