Amazon SageMaker HyperPod is a purpose-built infrastructure for optimizing foundation model training and inference at scale. SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs).
As AI moves towards deployment adopting to a multitude of domains and use cases, the need for security and multiple storage options is becoming more pertinent. Large enterprises want to make sure that the GPU clusters follow the organization wide policies and security rules. Two new features in SageMaker HyperPod EKS enhance this control and flexibility for production deployment of large-scale machine learning workloads. These features include support for continuous scaling, custom Amazon Machine Images, and customer managed key (CMK) integration.
- Customer managed keys (CMK) support: HyperPod EKS now allows customers to encrypt primary and secondary EBS volumes attached to HyperPod instances or their custom AMI with their own encryption keys. To learn more about creating a custom AMI for your HyperPod cluster, please see our blog post and documentation.
- Amazon EBS CSI support: HyperPod EKS now supports the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver, which manages the lifecycle of Amazon EBS volumes as storage for the Kubernetes volumes that you create.
Prerequisites
In order to use these features verify you have the following prerequisites:
- The AWS CLI is installed and configured with your account
- You have a SageMaker HyperPod cluster with Amazon EKS orchestration. To create your HyperPod cluster, please see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration
- CMK support can only be used with HyperPod cluster with
NodeProvisioningModeset toContinuous. EBS CSI driver support can be used on the NodeProvisioningMode settings. For more details on how to create your cluster to use continuous provisioning, please see Continuous provisioning for enhanced cluster operations on Amazon EKS.
Customer managed key support
With CMK support you control the encryption capabilities required for compliance and security governance, ultimately helping to resolve the critical business risk of unmet regulatory and organizational security requirements, such as HIPAA and FIPS compliance. CMK support allows customers to encrypt EBS volumes attached to their HyperPod instances using their own encryption keys. When creating a cluster, updating a cluster, or adding new instance groups, customers can specify a CMK for both root and secondary EBS volumes. Additionally, customers can encrypt their custom AMIs with CMK, providing comprehensive data-at-rest protection with customer-controlled keys throughout the instance lifecycle.
Here are the key points about CMK configuration:
For EBS volumes:
- CMK is optional – if not specified, volumes will be encrypted with AWS managed keys
- You cannot update/change the CMK for existing volumes (CMK is immutable)
- Each instance group can have:
- One root volume configuration with CMK
- One secondary volume configuration with CMK
- Root volume configurations cannot specify volume size
- Secondary volume configurations must specify volume size
- You can specify different CMKs for root and secondary volumes
For custom AMIs:
- You can encrypt custom AMIs with CMK independently of volume encryption
- Unlike volume CMK, custom AMI CMK is mutable – customers can patch clusters using AMIs encrypted with different CMKs
Important: When using customer managed keys, we strongly recommend that you use different KMS keys for each instance group in your cluster. Using the same customer managed key across multiple instance groups might lead to unintentional continued permissions even if you try to revoke a grant. For example:
- If you revoke an AWS KMS grant for one instance group’s volumes, that instance group might still allow scaling and patching operations due to grants existing on other instance groups using the same key
- To help prevent this issue, make sure that you assign unique KMS keys to each instance group in your cluster
Configuring CMK on HyperPod
In this section, we will demonstrate how to set up CMK for your HyperPod cluster. As a prerequisite, make sure you have the following:
- Verify that the AWS IAM execution role that you’re using for your CMK-enabled instance group has the following permissions for AWS KMS added. The
kms:CreateGrantpermission allows HyperPod to take the following actions using permissions to your KMS key:- Scaling out your instance count (
UpdateClusteroperations) - Adding cluster nodes (
BatchAddClusterNodesoperations) - Patching software (
UpdateClusterSoftwareoperations)
- Scaling out your instance count (
- Include this in your KMS key policy:
You can modify your key policy following the Change a key policy documentation. Replace variables <iam-hp-execution-role>, <region>, <account-id> , and <key-id> with your HyperPod execution role (the role that is linked to your instance group using CMKs), AWS Region your HyperPod cluster is deployed in, your account ID, and your KMS key ID, respectively.
Now, let’s use the CMK.
You can specify your customer managed keys when creating or updating a cluster using the CreateCluster and UpdateCluster API operations. The InstanceStorageConfigs structure allows up to two EbsVolumeConfig configurations, in which you can configure the root Amazon EBS volume and, optionally, a secondary volume. You can use the same KMS key or a different KMS key for each volume, depending on your needs.
When you are configuring the root volume, the following requirements apply:
RootVolumemust be set toTrue. The default value isFalse, which configures the secondary volume instead.- The
VolumeKmsKeyIdfield is required and you must specify your customer managed key. This is because the root volume must be encrypted with either an AWS owned key or a customer managed key (if you don’t specify your own, then an AWS owned key is used). - You can’t specify the
VolumeSizeInGBfield for root volumes since HyperPod determines the size of the root volume for you.
When configuring the secondary volume, the following requirements apply:
RootVolumemust beFalse(the default value of this field isFalse).- The
VolumeKmsKeyIdfield is optional. You can use the same customer managed key you specified for the root volume, or you can use a different key. - The
VolumeSizeInGBfield is required, since you must specify your desired size for the secondary volume.
Example of creating cluster with CMK support:
Example of updating a cluster with CMK support:
To use a custom AMI with CMK encryption, you would first have to build your custom AMI with your CMK. You can do this with the following tools, but note that these commands are sample snippets. Follow the linked documentation to generate the AMI.
- Amazon EC2 Console:
- Right-click on your customized Amazon EC2 instance and choose Create Image.
- In the Encryption section, select Encrypt snapshots.
- Select your KMS key from the dropdown. For example:
arn:aws:kms:us-east-2:111122223333:key/<your-kms-key-id>or use the key alias:alias/<your-hyperpod-key>.
To use this encrypted custom AMI, please follow our blog or documentation on using your custom AMI on HyperPod.
Amazon EBS CSI driver support
With Amazon Elastic Block Storage (EBS) Container Storage Interface (CSI) support in HyperPod you can manage the lifecycle of Amazon EBS volumes as storage for the Kubernetes Volumes created for your EKS clusters. Supporting both ephemeral and persistent volumes, this enhancement addresses the need for dynamic storage management in large-scale AI workloads, efficiently handling the massive datasets and model artifacts for foundation model training and inference.
HyperPod now offers two flexible approaches for provisioning and mounting additional Amazon EBS volumes on nodes. The first method, which isn’t new, uses InstanceStorageConfigs for cluster-level volume provisioning when creating or updating instance groups, requiring users to set the local path to /opt/sagemaker in their Pod configuration file. Alternatively, users can implement the Amazon EBS CSI driver for dynamic Pod-level volume management, providing greater control over storage allocation.
This feature was previously supported exclusively only on Amazon EKS clusters, now it unlocks new storage capabilities for the SageMaker HyperPod too. To read more about the capabilities yourself, follow the official documentation page.
Demo of the Amazon EBS CSI driver on SageMaker HyperPod
In this section, we will demo one of the capabilities of Amazon EBS CSI, such as volume resizing.
Setup EBS CSI Driver
In the following sections we will ask you to substitute some parameters with the values unique to your demo. When we refer to <eks-cluster-name>, that’s the name of the underlying Amazon EKS cluster, not the SageMaker HyperPod cluster. Configure your kubernetes config to add a new context, so the utils will interact with your new EKS cluster. Run the following:
Secondly, we need to create a IAM Service Account with an appropriate policy to work with Amazon EBS CSI. The IAM Service Account is the IAM entity for Amazon EKS to interact with other AWS services. We chose eksctl to create the policy and attach the required policy in a single command, however there are other ways to do the same.
After the successful execution of the command, we should expect three outcomes:
- IAM Service account with the name ebs-csi-controller-sa is created
- IAM role named DemoRole is created with policy arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy attached
- The ebs-csi-controller-sa service account consumes the DemoRole
During this demo you should see an output to the previous command, for example:
The final step of the IAM Service Account configuration is to attach extra policies required for the interaction between Amazon EKS and SageMaker HyperPod, mentioned in the feature’s documentation. We will do this with an inline policy, created from the terminal.
The following code snippet creates a temporary file and attaches it to the newly created policy, where you need to put in three values, related to your demo process:
- <region>
- <account-id>
- <eks-cluster-name>
Once the file is configured with your parameters, apply the policy to the DemoRole created before using eksctl:
To observe the results of the creation, we can use kubectl to inspect the service account’s state and an IAM role consumed by it:
To observe the role, we can check both attached managed policies and inline policies.For the attached managed:
For the inline policies:
Now, we are ready to create and install the Amazon EBS CSI add-on on the EKS cluster. For this example, use the following command:
You will see an output indicating that the creation has started, for example:
To track the status of add-on creation, you can use the watch utility from the terminal.
Note: If the status is stuck on CREATING for more than 5 minutes, you should debug the state of your cluster to see whether the pods are running. If the status isn’t changing, you might not have a sufficient number of instances or the instance type is too small. If you observe that many pods of the cluster are in the PENDING state that might be an indicator of one of these issues.
Running the volume resize demo
Now we’re ready for the demo, all the components are installed and ready to interact with each other. On your local machine, download the repository of AWS EBS CSI driver, then navigate to the folder of the resizing example.
Within this folder, we will utilize the provided example, which you can study yourself a bit more by reading the readme file.
Quoting the readme file, we are going to:
- Deploy the provided pod on your cluster along with the
StorageClassandPersistentVolumeClaim:
- Wait for the
PersistentVolumeClaimto bind and the pod to reach theRunningstate.
- Expand the volume size by increasing the capacity specification in the PersistentVolumeClaim using the editor, we use vim but you can use other editors. The following example is the content of the file with extra comments pointing to the places where you should change the capacity. Be attentive, as there are two places with storage volume – one is the specification, while the other is only a status. Changing the status will result in no changes.
- Wait a few minutes and verify that both the persistence volume and persistence volume claim have been appropriately resized. To do so, first, check the claim ebs-claim and use the VOLUME from the output to check the volume itself. In both outputs we now see the Capacity changed to 8Gi form initial 4Gi
Clean up the example:
We are done with the demo of the feature on the resize example, congratulations! Explore other examples in the same repository, like dynamic provisioning or block volume.
Clean up
To clean up your resources to avoid incurring more charges, complete the following steps:
- Delete your SageMaker HyperPod cluster.
- If you created the networking stack from the SageMaker HyperPod workshop, delete the stack as well to clean up the virtual private cloud (VPC) resources and the FSx for Lustre volume.
Conclusion
The new features in Amazon SageMaker HyperPod Customer Managed Key (CMK) support and Amazon EBS CSI driver support enhance system security and storage capabilities.The Amazon EBS CSI driver support within SageMaker HyperPod EKS clusters supports the use of Amazon EBS volumes for flexible and dynamic storage management options for large-scale AI workloads. In addition to other storage services already available with SageMaker HyperPod clusters, such as Amazon FSx or Amazon S3, you can build efficient and high performing AI solutions. By combining Amazon EBS volumes with Customer Managed Keys support, you can maintain compliance and security governance by controlling their own encryption keys.Together, these features make SageMaker HyperPod a more robust and enterprise-ready environment for training and deploying foundation models at scale, allowing organizations to meet both their security requirements and storage needs efficiently.
For more information, please see, Customer managed AWS KMS key encryption for SageMaker HyperPod and Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters.
About the authors
Mark Vinciguerra is an Associate Specialist Solutions Architect at Amazon Web Services (AWS) based in New York. He focuses on Generative AI training and inference, with the goal of helping customers architect, optimize, and scale their workloads across various AWS services. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering. You can connect with him on LinkedIn.
Rostislav (Ross) Povelikin is a Senior Specialist Solutions Architect at AWS focusing on systems performance for distributed training and inference. Prior to this, he focused on datacenter network and software performance optimisations at NVIDIA.
Kunal Jha is a Principal Product Manager at AWS, where he focuses on building Amazon SageMaker HyperPod to enable scalable distributed training and fine-tuning of foundation models. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can connect with him on LinkedIn.
Takuma Yoshitani is a Senior Software Development Engineer at AWS, where he focuses on improving the experience of the SageMaker HyperPod service. Prior to SageMaker, he has contributed to Amazon Go / Just Walk-Out tech.
Vivek Koppuru is an engineering leader on the Amazon SageMaker HyperPod team helping provide infrastructure solutions for ML training and inference. He has years of experience in AWS and compute as an engineer, working on core services like EC2 and EKS. He is passionate about building customer-focused solutions and navigating through complex technical challenges in distributed systems with the team.
Ajay Mahendru is an engineering leader at AWS, working in the SageMaker HyperPod team. Bringing in nearly 15+ years of software development experience, Ajay has contributed to multiple AWS SageMaker Services inlcuding SageMaker Inference, Training, Processing and HyperPod. With an expertise in building distributed systems, he focuses on building reliable, customer-focused and scalable solutions across teams.
Siddharth Senger currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. Bringing nearly a decade of software development experience, Siddharth has contributed to several across Amazon, including Retail, Amazon Rekognition, Amazon Textract and AWS SageMaker. He is passionate about building reliable, scalable, and efficient distributed systems that empower customers to accelerate large-scale machine learning and AI innovation.


