Anyscale · Rate Limits

Anyscale Rate Limits

Anyscale is a control-plane API for managing Ray compute. Throughput limits primarily come from the underlying cloud quotas (per-region instance and GPU quotas in the customer's AWS / GCP account or Anyscale's hosted account). Control-plane API call rates are not publicly documented and are pending reconciliation; service-level rate limits on Ray Serve services are controlled by user code and autoscaling configuration.

4 Limits Throttle: 429
AIDistributed ComputingRayML PlatformInferenceRate LimitingQuotasThrottling

Limits

Control-Plane API organization
requests
see provider documentation
Pending reconciliation.
Concurrent Workspaces / Jobs / Services organization
concurrent
bounded by cloud quotas and org limits
Practical concurrency is bounded by AWS / GCP instance and GPU quotas.
Cluster Node Counts cluster
nodes
bounded by autoscaling and cloud quotas
Configured per compute config and bounded by cloud GPU quotas.
Service Endpoint service
requests
user-configured
Throughput on deployed Ray Serve services is controlled by application autoscaling.

Policies

Backoff Strategy
Clients should implement exponential backoff with jitter and honor Retry-After.
Cloud Quota Management
Request AWS / GCP quota increases ahead of large training or inference rollouts.

Sources