Amazon SageMaker · Rate Limits

Amazon Sagemaker Rate Limits

Amazon SageMaker exposes a control-plane API (CreateTrainingJob, CreateEndpoint, etc.) that follows AWS API throttling per account/region, plus a runtime InvokeEndpoint surface whose throughput scales with the underlying instance count and instance type. Endpoint-specific quotas (concurrent invocations, payload size, timeout) are configurable. ServiceQuotas governs the maximum number and type of instances per account.

5 Limits Throttle: 400 Quota: 400
Rate LimitingMachine LearningSageMaker

Limits

SageMaker control-plane API account/region
varies
see Service Quotas console for SageMaker
Standard AWS API throttling envelope.
InvokeEndpoint (real-time) endpoint
requests_per_second
scales with instance count and type
Default soft limit per endpoint; configure auto-scaling on the production variant. Payload up to 6 MB synchronous, 1 GB asynchronous.
InvokeEndpoint payload size endpoint
bytes
6291456
6 MB max synchronous payload; use AsynchronousInferenceConfig for larger payloads (up to 1 GB).
Synchronous invocation timeout endpoint
seconds ยท second
60
Default 60s; can be raised on async endpoints up to 1 hour.
ML instances per type per region account/region
count
see Service Quotas console for SageMaker
Soft limits; raise via Service Quotas before training/deploying at scale.

Policies

Backoff with jitter
AWS SDKs default to standard retry mode (truncated exponential backoff with jitter, max 20s, 3 attempts).
Auto-scaling
Configure target-tracking scaling on production variants (InvocationsPerInstance) to absorb load.
Quota increases
ML instance counts, training-job concurrency, and notebook quotas are all soft limits; raise via Service Quotas before campaigns.
Async inference for large payloads
Use SageMaker Asynchronous Inference for payloads >6 MB or processing >60s, queuing requests to a SageMaker-managed S3 location.

Sources