Amazon SageMaker · Rate Limits
Amazon Sagemaker Rate Limits
Amazon SageMaker exposes a control-plane API (CreateTrainingJob, CreateEndpoint, etc.) that follows AWS API throttling per account/region, plus a runtime InvokeEndpoint surface whose throughput scales with the underlying instance count and instance type. Endpoint-specific quotas (concurrent invocations, payload size, timeout) are configurable. ServiceQuotas governs the maximum number and type of instances per account.
5 Limits
Throttle: 400
Quota: 400
Rate LimitingMachine LearningSageMaker
Limits
SageMaker control-plane API account/region
see Service Quotas console for SageMaker
Standard AWS API throttling envelope.
InvokeEndpoint (real-time) endpoint
scales with instance count and type
Default soft limit per endpoint; configure auto-scaling on the production variant. Payload up to 6 MB synchronous, 1 GB asynchronous.
InvokeEndpoint payload size endpoint
6291456
6 MB max synchronous payload; use AsynchronousInferenceConfig for larger payloads (up to 1 GB).
Synchronous invocation timeout endpoint
60
Default 60s; can be raised on async endpoints up to 1 hour.
ML instances per type per region account/region
see Service Quotas console for SageMaker
Soft limits; raise via Service Quotas before training/deploying at scale.
Policies
Backoff with jitter
AWS SDKs default to standard retry mode (truncated exponential backoff with jitter, max 20s, 3 attempts).
Auto-scaling
Configure target-tracking scaling on production variants (InvocationsPerInstance) to absorb load.
Quota increases
ML instance counts, training-job concurrency, and notebook quotas are all soft limits; raise via Service Quotas before campaigns.
Async inference for large payloads
Use SageMaker Asynchronous Inference for payloads >6 MB or processing >60s, queuing requests to a SageMaker-managed S3 location.