vLLM · Rate Limits

Vllm Rate Limits

vLLM does not impose project-level API rate limits. Throughput is bounded by GPU memory, model size, batch settings (--max-num-seqs, --max-model-len), and tensor/pipeline parallelism. Optionally, set --api-key to require auth and put a reverse proxy (Nginx, Envoy) in front to enforce per-client throttles.

2 Limits Throttle: 429

LLMInferenceOpen SourceGPUOpenAI CompatibleSelf-HostedRate LimitingQuotasThrottling

Limits

Project-level n/a

n/a

no built-in cap

Throughput is bounded by GPU and batching.

Per-deployment (operator-set) deployment

concurrent-requests

configured via --max-num-seqs

Operator tunes max concurrency at server start.

Policies

Reverse-Proxy Throttling

Front vLLM with Nginx/Envoy to enforce per-API-key or per-IP rate limits.

Batch Tuning

Tune --max-num-seqs, --max-model-len, and tensor parallelism for throughput vs. latency trade-offs.

Sources

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html