Deploying models efficiently, reliably, and cost-effectively is a critical challenge for organizations of all sizes. As organizations increasingly deploy foundation models (FMs) and other machine learning (ML) models to production, they face challenges related to resource utilization, cost-efficiency, and maintaining high availability during updates. Amazon SageMaker AI introduced inference component functionality that can help organizations reduce model deployment costs by optimizing resource utilization through intelligent model packing and scaling. Inference components abstract ML models and enable assigning dedicated resources and specific scaling policies per model.
However, updating these models—especially in production environments with strict latency SLAs—has historically risked downtime or resource bottlenecks. Traditional blue/green deployments often struggle with capacity constraints, making updates unpredictable for GPU-heavy models. To address this, we’re excited to announce another powerful enhancement to SageMaker AI: rolling updates for inference component endpoints, a feature designed to streamline updates for models of different sizes while minimizing operational overhead.
In this post, we discuss the challenges faced by organizations when updating models in production. Then we deep dive into the new rolling update feature for inference components and provide practical examples using DeepSeek distilled models to demonstrate this feature. Finally, we explore how to set up rolling updates in different scenarios.
Challenges with blue/green deployment
Traditionally, SageMaker AI inference has supported the blue/green deployment pattern for updating inference components in production. Though effective for many scenarios, this approach comes with specific challenges:
Resource inefficiency – Blue/Green deployment requires provisioning resources for both the current (blue) and new (green) environments simultaneously. For inference components running on expensive GPU instances like P4d or G5, this means potentially doubling the resource requirements during deployments. Consider an example where a customer has 10 copies of an inference component spread across 5 ml.p4d.24xlarge instances, all operating at full capacity. With blue/green deployment, SageMaker AI would need to provision five additional ml.p4d.24xlarge instances to host the new version of the inference component before switching traffic and decommissioning the old instances.
Limited computing resources – For customers using powerful GPU instances like the P or G series, the required capacity might not be available in a given Availability Zone or Region. This often results in instance capacity exceptions during deployments, causing update failures and rollbacks.
All-or-nothing transitions – Traditional blue/green deployments shift all traffic at one time or based on a configured schedule. This leaves limited room for gradual validation and increases the area of effect if issues arise with the new deployment.
Although blue/green deployment has been a reliable strategy for zero-downtime updates, its limitations become glaring when deploying large-scale large language models (LLMs) or high-throughput models on premium GPU instances. These challenges demand a more nuanced approach—one that incrementally validates updates while optimizing resource usage. Rolling updates for inference components are designed to eliminate the rigidity of blue/green deployments. By updating models in controlled batches, dynamically scaling infrastructure, and integrating real-time safety checks, this strategy makes sure deployments remain cost-effective, reliable, and adaptable—even for GPU-heavy workloads.
Rolling deployment for inference component updates
As mentioned earlier, inference components are introduced as a SageMaker AI feature to optimize costs; they allow you to define and deploy the specific resources needed for your model inference workload. By right-sizing compute resources to match your model’s requirements, you can save costs during updates compared to traditional deployment approaches.
With rolling updates, SageMaker AI deploys new model versions in configurable batches of inference components while dynamically scaling instances. This is particularly impactful for LLMs:
Batch size flexibility – When updating the inference components in a SageMaker AI endpoint, you can specify the batch size for each rolling step. For each step, SageMaker AI provisions capacity based on the specified batch size on the new endpoint fleet, routes traffic to that fleet, and stops capacity on the old endpoint fleet. Smaller models like DeepSeek Distilled Llama 8B can use larger batches for rapid updates, and larger models like DeepSeek Distilled Llama 70B use smaller batches to limit GPU contention.
Automated safety guards – Integrated Amazon CloudWatch alarms monitor metrics on an inference component. You can configure the alarms to check if the newly deployed version of inference component is working properly or not. If the CloudWatch alarms are triggered, SageMaker AI will start an automated rollback.
The new functionality is implemented through extensions to the SageMaker AI API, primarily with new parameters in the UpdateInferenceComponent API:
The preceding code uses the following parameters:
MaximumBatchSize – This is a required parameter and defines the batch size for each rolling step in the deployment process. For each step, SageMaker AI provisions capacity on the new endpoint fleet, routes traffic to that fleet, and stops capacity on the old endpoint fleet. The value must be between 5–50% of the copy count of the inference component.
Type – This parameter could contain a value like COPY_COUNT | CAPACITY_PERCENT, which specifies the endpoint capacity type.
Value – This defines the capacity size, either as a number of inference component copies or a capacity percentage.
MaximumExecutionTimeoutInSeconds – This is the maximum time that the rolling deployment would spend on the overall execution. Exceeding this limit causes a timeout.
RollbackMaximumBatchSize – This is the batch size for a rollback to the old endpoint fleet. If this field is absent, the value is set to the default, which is 100% of the total capacity. When the default is used, SageMaker AI provisions the entire capacity of the old fleet at the same time during rollback.
Value – The Value parameter of this structure would contain the value with which the Type would be executed. For a rollback strategy, if you don’t specify the fields in this object, or if you set the Value to 100%, then SageMaker AI uses a blue/green rollback strategy and rolls traffic back to the blue fleet.
WaitIntervalInSeconds – This is the time limit for the total deployment. Exceeding this limit causes a timeout.
AutoRollbackConfiguration – This is the automatic rollback configuration for handling endpoint deployment failures and recovery.
AlarmName – This CloudWatch alarm is configured to monitor metrics on an InferenceComponent. You can configure it to check if the newly deployed version of InferenceComponent is working properly or not.
For more information about the SageMaker AI API, refer to the SageMaker AI API Reference.
Customer experience
Let’s explore how rolling updates work in practice with several common scenarios, using different-sized LLMs. You can find the example notebook in the GitHub repo.
Scenario 1: Multiple single GPU cluster
In this scenario, assume you’re running an endpoint with three ml.g5.2xlarge instances, each with a single GPU. The endpoint hosts an inference component that requires one GPU accelerator, which means each instance holds one copy. When you want to update the inference component to use a new inference component version, you can use rolling updates to minimize disruption.
You can configure a rolling update with a batch size of one, meaning SageMaker AI will update one copy at a time. During the update process, SageMaker AI first identifies available capacity in the existing instances. Because none of the existing instances has space for additional temporary workloads, SageMaker AI will launch new ml.g5.2xlarge instances one at a time to deploy one copy of the new inference component version to a GPU instance. After the specified wait interval and the new inference component’s container passes healthy check, SageMaker AI removes one copy of the old version (because each copy is hosted on one instance, this instance will be torn down accordingly), completing the update for the first batch.
This process repeats for the second copy of the inference component, providing a smooth transition with zero downtime. The gradual nature of the update minimizes risk and allows you to maintain consistent availability throughout the deployment process. The following diagram shows this process.
Scenario 2: Update with automatic rollback
In another scenario, you might be updating your inference component from Llama-3.1-8B-Instruct to DeepSeek-R1-Distill-Llama-8B, but the new model version has different API expectations. In this use case, you have configured a CloudWatch alarm to monitor for 4xx errors, which would indicate API compatibility issues.
You can initiate a rolling update with a batch size of one copy. SageMaker AI deploys the first copy of the new version on a new GPU instance. When the new instance is ready to serve traffic, SageMaker AI will forward a proportion of the invocation requests to this new model. However, in this example, the new model version, which is missing the “MESSAGES_API_ENABLED” environment variable configuration, will begin to return 4xx errors when receiving requests in the Messages API format.
The configured CloudWatch alarm detects these errors and transitions to the alarm state. SageMaker AI automatically detects this alarm state and initiates a rollback process according to the rollback configuration. Following the specified rollback batch size, SageMaker AI removes the problematic new model version and maintains the original working version, preventing widespread service disruption. The endpoint returns to its original state with traffic being handled by the properly functioning original model version.
The following code snippet shows how to set up a CloudWatch alarm to monitor 4xx errors:
Then you can use this CloudWatch alarm in the update request:
Scenario 3: Update with sufficient capacity in the existing instances
If an existing endpoint has multiple GPU accelerators and not all the accelerators are used, the update can use existing GPU accelerators without launching new instances to the endpoint. Consider if you have an endpoint configured with an initial two ml.g5.12xlarge instances that have four GPU accelerators in each instance. The endpoint hosts two inference components: IC-1 requires one accelerator and IC-2 also requires one accelerator. On one ml.g5.12xlarge instance, there are four copies of IC-1 that have been created; on the other instance, two copies of IC-2 have been created. There are still two GPU accelerators available on the second instance.
When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there is sufficient capacity in the existing instances to host the new versions while maintaining the old ones. It will create two copies of the new IC-1 version on the second instance. When the containers are up and running, SageMaker AI will direct traffic to the new IC-1s and then start routing traffic to the new inference components. SageMaker AI will also remove two of the old IC-1 copies from the instance. You are not charged until the new inference components start taking the invocations and generating responses.
Now another two free GPU slots are available. SageMaker AI will update the second batch, and it will use the free GPU accelerators that just became available. After the processes are complete, the endpoint has four IC-1 with the new version and two copies of IC-2 that weren’t changed.
Scenario 4: Update requiring additional instance capacity
Consider if you have an endpoint configured with initially one ml.g5.12xlarge instance (4 GPUs total) and configured managed instance scaling (MIS) with a maximum instance number set to two. The endpoint hosts two inference components: IC-1 requiring 1 GPU with two copies (Llama 8B), and IC-2 (DeepSeek Distilled Llama 14B model) also requiring 1 GPU with two copies—utilizing all 4 available GPUs.
When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there’s insufficient capacity in the existing instances to host the new versions while maintaining the old ones. Instead of failing the update, as you have configured MIS, SageMaker AI will automatically provision a second g5.12.xlarge instance to host the new inference components.
During the update process, SageMaker AI deploys two copies of the new IC-1 version onto the newly provisioned instance, as shown in the following diagram. After the new inference components are up and running, SageMaker AI begins removing the old IC-1 copies from the original instances. By the end of the update, the first instance will host IC-2 utilizing 2 GPUs, and the newly provisioned second instance will host the updated IC-1 with two copies using 2 GPUs. There will be new spaces available in the two instances, and you can deploy more inference component copies or new models to the same endpoint using the available GPU resources. If you set up managed instance auto scaling and set inference component auto scaling to zero, you can scale down the inference component copies to zero, which will result in the corresponding instance being scaled down. When the inference component is scaled up, SageMaker AI will launch the inference components in the existing instance with the available GPU accelerators, as mentioned in scenario 3.
Scenario 5: Update facing insufficient capacity
In scenarios where there isn’t enough GPU capacity, SageMaker AI provides clear feedback about capacity constraints. Consider if you have an endpoint running on 30 ml.g6e.16xlarge instances, each already fully utilized with inference components. You want to update an existing inference component using a rolling deployment with a batch size of 4, but after the first four batches are updated, there isn’t enough GPU capacity available for the remaining update. In this case, SageMaker AI will automatically roll back to the previous setup and stop the update process.
There can be two cases for this rollback final status. In the first case, the rollback was successful because there was new capacity available to launch the instances for the old model version. However, there could be another case where the capacity issue persists during rolling back, and the endpoint will show as UPDATE_ROLLBACK_FAILED. The existing instances can still serve traffic, but to move the endpoint out of the failed status, you need to contact your AWS support team.
Additional considerations
As mentioned earlier, when using blue/green deployment to update the inference components on an endpoint, you need to provision resources for both the current (blue) and new (green) environments simultaneously. When you’re using rolling updates for inference components on the endpoint, you can use the following equation to calculate the number of account service quotas for the instance type required. The GPU instance required for the endpoint has X number of GPU accelerators, and each inference component copy requires Y number of GPU accelerators. The maximum batch size is set to Z and the current endpoint has N instances. Therefore, the account-level service quota required for this instance type for the endpoint should be greater than the output of the equation:
ROUNDUP(Z x Y / X) + N
For example, let’s assume the current endpoint has 8 (N) ml.g5.12xlarge instances, which has 4 GPU accelerators of each instance. You set the maximum batch size to 2 (Z) copies, and each needs 1 (Y) GPU accelerators. The minimum AWS service quota value for ml.g5.12xlarge is ROUNDUP(2 x 1 / 4) + 8 = 9. In another scenario, when each copy of inference component requires 4 GPU accelerators, then the required account-level service quota for the same instance should be ROUNDUP(2 x 4 / 4) + 8 = 10.
Conclusion
Rolling updates for inference components represent a significant enhancement to the deployment capabilities of SageMaker AI. This feature directly addresses the challenges of updating model deployments in production, particularly for GPU-heavy workloads, and it eliminates capacity guesswork and reduces rollback risk. By combining batch-based updates with automated safeguards, SageMaker AI makes sure deployments are agile and resilient.
Key benefits include:
Reduced resource overhead during deployments, eliminating the need to provision duplicate fleets
Improved deployment guardrails with gradual updates and automatic rollback capabilities
Continued availability during updates with configurable batch sizes
Straightforward deployment of resource-intensive models that require multiple accelerators
Whether you’re deploying compact models or larger multi-accelerator models, rolling updates provide a more efficient, cost-effective, and safer path to keeping your ML models current in production.
We encourage you to try this new capability with your SageMaker AI endpoints and discover how it can enhance your ML operations. For more information, check out the SageMaker AI documentation or connect with your AWS account team.
About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.
Dustin Liu is a solutions architect at AWS, focused on supporting financial services and insurance (FSI) startups and SaaS companies. He has a diverse background spanning data engineering, data science, and machine learning, and he is passionate about leveraging AI/ML to drive innovation and business transformation.
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Shikher Mishra is a Software Development Engineer with SageMaker Inference team with over 9+ years of industry experience. He is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. In his spare time, Shikher enjoys outdoor sports, hiking and traveling.
June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.