Ollama · Rate Limits
Ollama Rate Limits
Local Ollama (http://localhost:11434) has no rate limits or authentication. Ollama Cloud enforces tier-based concurrency (Free, Pro=3, Max=10 concurrent cloud models) and weekly GPU-time quotas rather than per-second request ceilings. Cloud quotas reset on 5-hour session and 7-day weekly cycles. Specific TPS / RPM ceilings are not publicly documented.
5 Limits
Throttle: 429
Artificial IntelligenceLarge Language ModelsModelsRate Limiting
Limits
Local server localhost
unlimited (bounded by local hardware)
No auth, no rate limiting on http://localhost:11434.
Cloud — concurrent models (Free) account
1
Free tier permits running cloud models with basic limits.
Cloud — concurrent models (Pro) account
3
Pro tier permits 3 cloud models at a time.
Cloud — concurrent models (Max) account
10
Max tier permits 10 cloud models at a time.
Cloud — weekly GPU usage account
tier-dependent (Free baseline, Pro = 50x Free, Max = 5x Pro)
Quota resets on 5-hour session and 7-day weekly windows.
Policies
Authentication
Local server requires no auth. Cloud requires an ollama.com account; pass the API key via the OLLAMA_API_KEY env var or Authorization Bearer header.
Backoff
On 429 from cloud, back off and retry; honor Retry-After if present.
Privacy
Ollama states cloud prompt/response data is never logged or trained on; cloud inference runs primarily in US data centers, with EU/Singapore routing for capacity.
Hybrid scheduling
Cloud models can be invoked transparently from a local Ollama instance via signed-in cloud access; routing to localhost vs ollama.com is selected per-request by model name.