DeepSeek · Rate Limits

Deepseek Rate Limits

DeepSeek does not publish fixed numerical rate limits. Instead, the API dynamically caps user concurrency based on current server load and returns HTTP 429 once a caller's concurrency ceiling is reached. There is no hard requests-per-minute or tokens-per-minute cap published on the public docs; sustained throughput is therefore best-effort and varies in real time. Inference connections that have not begun streaming within ten minutes of being accepted are closed by the server.

2 Limits Throttle: 429
AIArtificial IntelligenceChatLLMLarge Language ModelsRate Limiting

Limits

Dynamic concurrency cap api-key
concurrent_requests
dynamic — varies with current server load
DeepSeek throttles by concurrent in-flight requests rather than RPS. The exact ceiling is not published and shifts with load.
Inference connection timeout connection
seconds_to_first_token · second
600
If inference does not start within 10 minutes after the connection is accepted, the server closes the connection. Affects very long queues during load spikes.

Policies

Backoff Strategy
On HTTP 429, callers should back off and retry with exponential delay and jitter. DeepSeek does not document a Retry-After header, so clients must use their own retry budget.
Keep-Alive Signaling
While a request is queued the server returns empty lines (non-streaming) or SSE keep-alive comments (streaming). Clients must tolerate these placeholder frames.
Connection Lifetime
Drop and re-establish connections that have been idle for more than 10 minutes without any token output, since the server will close them.
Concurrency Sizing
Because limits are dynamic, callers should target a configurable concurrency budget per API key and reduce parallelism when 429 responses spike.

Sources