vLLM · Rate Limits
Vllm Rate Limits
vLLM does not impose project-level API rate limits. Throughput is bounded by GPU memory, model size, batch settings (--max-num-seqs, --max-model-len), and tensor/pipeline parallelism. Optionally, set --api-key to require auth and put a reverse proxy (Nginx, Envoy) in front to enforce per-client throttles.
2 Limits
Throttle: 429
LLMInferenceOpen SourceGPUOpenAI CompatibleSelf-HostedRate LimitingQuotasThrottling
Limits
Project-level n/a
no built-in cap
Throughput is bounded by GPU and batching.
Per-deployment (operator-set) deployment
configured via --max-num-seqs
Operator tunes max concurrency at server start.
Policies
Reverse-Proxy Throttling
Front vLLM with Nginx/Envoy to enforce per-API-key or per-IP rate limits.
Batch Tuning
Tune --max-num-seqs, --max-model-len, and tensor parallelism for throughput vs. latency trade-offs.