Accelerate large-scale AI training with Amazon SageMaker HyperPod training operator

Large-scale AI model training faces significant challenges with failure recovery and monitoring. Traditional training requires complete job restarts when even a single training process fails, resulting in additional downtime and increased costs. As training clusters expand, identifying and resolving critical issues like stalled GPUs and numerical instabilities typically requires complex custom monitoring code.

With Amazon SageMaker HyperPod you can accelerate AI model development across hundreds or thousands of GPUs with built-in resiliency, decreasing model training time by up to 40%. The Amazon SageMaker HyperPod training operator further enhances training resilience for Kubernetes workloads through pinpoint recovery and customizable monitoring capabilities.

In this blog post, we show you how to deploy and manage machine learning training workloads using the Amazon SageMaker HyperPod training operator, including setup instructions and a complete training example.

Amazon SageMaker HyperPod training operator

The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. The Amazon SageMaker HyperPod training operator uses built-in fault resiliency components, comes packaged as an Amazon Elastic Kubernetes Service (Amazon EKS) add-on, and deploys the necessary custom resource definitions (CRDs) to the HyperPod cluster.

Solution overview

The following diagram depicts the architecture of Amazon SageMaker HyperPod training operator.

The HyperPod training operator follows Kubernetes operator pattern and has the following major components:

Custom Resource Definition (CRDs): HyperPodPyTorchJob defines the job specification (for example, node count, image) and serves as the interface for customers to submit jobs.
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
RBAC policies: Defines the actions the controller is allowed to perform, such as creating pods and managing HyperPodPyTorchJob resources.
Job controller: Listens to job creation and fulfills requests by creating job pods and pod managers.
Pod manager: Monitors training process health on each pod. The number of Pod Managers is determined by the number of pods required by the job. One Pod Manager currently controls several hundred pods.
HyperPod elastic agent: Customers install the elastic agent into their training container. It orchestrates lifecycles of training workers on each container and communicates with the Amazon SageMaker HyperPod training operator. The HyperPod elastic agent is an extension of PyTorch’s ElasticAgent.

The job Controller uses fault detection components such as the SageMaker HyperPod health-monitoring agent and node health check mechanisms like AWS retirement notices to update job state and repair faults. It also relies on the HyperPod elastic agent to check the status of training processes for crashes and hung job detection.

When a HyperPodPyTorch job is submitted, the Amazon SageMaker HyperPod training operator spins up job pods along with pod manager pods that help manage the training job lifecycle. The pod managers interact with the HyperPod elastic agent so that all job pods maintain a healthy state.

Benefits of using the operator

The Amazon SageMaker HyperPod training operator can be installed as an EKS add-on on your cluster. The key benefits include:

Centralized training process monitoring and restart – The HyperPod training operator maintains a control plane with a global view of health across all ranks. When one rank encounters an issue, it broadcasts a stop signal to all ranks to prevent other ranks from failing individually at different times due to collective communication timeout. This supports more efficient fault detection and recovery.
Centralized efficient rank assignment – A separate HyperPod rendezvous backend allows the HyperPod training operator to assign ranks directly. This reduces initialization overhead by eliminating the need for worker-to-worker discovery.
Unhealthy training node detection and job restart – The HyperPod training operator is fully integrated with the HyperPod EKS cluster resiliency features, helping restart jobs or training processes due to bad nodes and hardware issues in ML workloads. This reduces the need to self-manage job recovery solutions.
Granular process recovery – Rather than restarting entire jobs when failures occur, the operator precisely targets and restarts only training processes, reducing recovery times from tens of minutes to seconds. This makes HyperPod training operator job recovery time scale linearly as cluster size grows.
Hanging job detection and performance degradation detection – Based on training script log monitoring, the HyperPod training operator helps overcome problematic training scenarios including stalled training batches, non-numeric loss values, and performance degradation through simple YAML configurations. For more information see, Using the training operator to run jobs in the Amazon SageMaker AI Developer Guide.

Training operator setup

This section walks through installing the Amazon SageMaker HyperPod training operator as an Amazon EKS add-on.

Estimated Setup Time: 30-45 minutes

Prerequisites

Before getting started, verify that you have the following resources and permissions.

Required AWS resources:

Active AWS account
Amazon EKS cluster (version 1.28 or later)
Amazon SageMaker HyperPod EKS cluster
Amazon ECR repository for container images

Required IAM permissions:

AmazonSageMakerHyperPodTrainingOperatorAccess managed policy
EKS cluster access permissions
ECR push/pull permissions
eks-pod-identity-agent add-on installed on EKS cluster

Required software:

kubectl (version 1.28 or later), for more information see the kubectl installation documentation
docker (version 20.10 or later), for more information see the docker installation documentation
AWS Command Line Interface (AWS CLI) (version 2.0 or later), for more information see the AWS CLI installation documentation
envsubst utility
HuggingFace account with access token

Installation instructions

Before running the installation steps below, you’ll need to first create a HyperPod cluster. If you haven’t done this one already please follow the instructions to create an EKS-orchestrated SageMaker HyperPod cluster to get started. Make sure to install eks-pod-identity-agent add-on on the EKS cluster, by following the Set up the Amazon EKS Pod Identity Agent instructions.

Install cert-manager

First, install the cert-manager add-on which is required for the HyperPod training operator:

Open the Amazon EKS console
Navigate to your EKS cluster and go to the Add-ons page
On the Add-ons page, locate Get more add-ons and navigate to the Community add-ons section
Find the Cert Manager add-on, select it, and choose Next
On the add-on configuration page, proceed with default settings and choose Next
Preview all selections for the Cert Manager add-on and choose Create
Wait for the add-on status to change to Active before proceeding

Install the HyperPod training operator add-on

Once cert-manager is active, install the Amazon SageMaker HyperPod training operator:

Open the Amazon SageMaker console
Navigate to your cluster’s details page
On the Dashboard tab, locate Amazon SageMaker HyperPod training operator and choose Install

During installation, SageMaker creates an IAM execution role with permissions similar to the AmazonSageMakerHyperPodTrainingOperatorAccess managed policy and creates a pod identity association between your Amazon EKS cluster and the new execution role.

Verify installation

We have now successfully setup of the Amazon SageMaker HyperPod training operator. You can confirm that the pods are running by using the following command:

kubectl -n aws-hyperpod get pods -l hp-training-control-plane=hp-training-operator-controller-manager

Your output should contain the training operator controller as shown below:

NAME READY      STATUS         RESTARTS        AGE
hp-training-operator-hp-training-controller-manager-85c68bmd79b    1/1              Running         0                        24m

Set up training job

Let’s run a PyTorch-based training example on a Llama model. We begin by checking out the following code base:

git clone 
cd awsome-distributed-training/tree/main/3.test_cases/pytorch/FSDP

These scripts provide an easy way to get started with multinode FSDP training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image.

Next, build the docker container image.

aws ecr-public get-login-password —region us-east-1 | docker login —username AWS —password-stdin public.ecr.aws/hpc-cloud
export REGION=$(aws ec2 describe-availability-zones —output text —query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity —query Account —output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${REGION}.

docker build -t ${REGISTRY}fsdp:pytorch2.5.1 .

The above command works with linux based environments, if you are on a Mac, use buildx to target linux/amd64 architecture:

docker buildx build --platform linux/amd64 -t ${REGISTRY}fsdp:pytorch2.5.1 .

Push the image to Amazon ECR:

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "fsdp" | wc -l)
if [ "$REGISTRY_COUNT" -eq 0 ]; then
    aws ecr create-repository --repository-name fsdp
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}fsdp:pytorch2.5.1

Note: Pushing the image may take some time depending on your network bandwidth.

Data

For this example, we’ll be using the allenai/c4 dataset. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.

If you’d like to instead use your own dataset, you can do so by formatting it as a HuggingFace dataset, and passing its location to the --dataset_path argument.

For the dataset, you will need a Hugging Face access token. First, create a Hugging Face account. Then generate your access token with read permissions.

We will reference this token in the next step by setting it as an environment variable.

This example uses envsubst to generate a Kubernetes manifest file from a template file and parameters. If you don’t have envsubst on your development environment, install it by following the installation instructions.

Launch Llama 3.1 8B training job

Next, we generate the Kubernetes manifest and apply it to the cluster. Let’s navigate to the FSDP source repo:

cd awsome-distributed-training/tree/main/3.test_cases/pytorch/FSDP/Kubernetes

Here, we start by creating environment variables that are used in our training job. Fill out the placeholders as per your cluster size.

cat << EOF > env_vars
export ACCOUNT_ID=<AWS_ACCOUNT_ID>
export REGION=<REGION>
export REGISTRY=${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com
export IMAGE_URI=${REGISTRY}/fsdp:pytorch2.5.1
export INSTANCE_TYPE=<INSTANCE TYPE> # ml.p5.48xlarge
export NUM_NODES=<NUMBER OF NODES> # 2
export GPU_PER_NODE=<NUMBER OF GPUS PER NODE> # 8
export EFA_PER_NODE=<NUMBER OF EFA PER NODE> # 32
export FI_PROVIDER=efa
export HF_TOKEN=<YOUR HF ACCESS TOKEN> # HF_xxxx
EOF

Once you fill in env_vars and then source variables:

source env_vars

You can apply yaml to submit the training job:

envsubst < llama3_1_8b-fsdp-hpto.yaml | kubectl apply -f -

You can also adjust the training parameters in the TRAINING_ARGS section of the llama3_1_8b-fsdp-hpto.yaml. Additional parameters can be found under model/arguments.py. Note that we use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way if our training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

Additionally, you can also prepare and submit your jobs compatible with the Amazon SageMaker HyperPod training operator through the HyperPod CLI and SDK capabilities that have been recently announced, more reading information on how to use it is available in this development guide.

Monitor training job

To see the status of your job, use the following command:

kubectl get hyperpodpytorchjobs

Use the following command to list the jobs ran using HyperPod training operator:

NAME              AGE
llama2-13b-fsdp   2m15s

kubectl get pods

Use the following command to list all the pods for the training jobs:

NAME                    READY  STATUS   RESTARTS AGE
llama2-13b-fsdp-pods-0  1/1    Running   0       13s
llama2-13b-fsdp-pods-1  1/1    Running   0       13s
llama2-13b-fsdp-pods-2  1/1    Running   0       13s
llama2-13b-fsdp-pods-3  1/1    Running   0       13s

To check the pod logs run the below command to continuously stream the logs to stdout, use the following command:

kubectl logs -f llama2-13b-fsdp-pods-0

Configure log monitoring

With Amazon SageMaker HyperPod training operators users can configure log patterns that the operator continuously monitors. The HyperPod operator continuously looks for the configured regex pattern and stops the training job if it finds a violation. The llama3_1_13b-fsdp-hpto.yaml file that we used previously contains log monitoring configurations for tracking Job start hangs, hang detection during training, and checkpoint creation failures as shown below:

  logMonitoringConfiguration:
      - name: "JobStart"
        logPattern: ".*Loss:.*"
        expectedStartCutOffInSeconds: 240 # job should print loss within 4 mins of start time
      - name: "JobHangingDetection"
        logPattern: ".*Loss:.*"
        expectedRecurringFrequencyInSeconds: 300 # if next batch is not printed within 300 seconds
      - name: "NoSCheckpointingDetection"
        logPattern: ".*Completed checkpoint.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 900 # Allow 30 minutes for first checkpoint upload

And the corresponding code files in /src/train.py have the necessary log statements.

logger.info(
             "Batch %d Loss: %.5f, Speed: %.2f samples/sec, lr: %.6f",  # pylint: disable=line-too-long
                    batch_idx,
                    loss_scalar,
                    throughput,
                    current_lr,
                )

Any time these metrics exhibit deviation from their expected values, the operator will detect it as a fault, and trigger a recovery process to re-execute the job, up to a user-specified maximum number of retries.

Additionally, the HyperPod training operator also supports integration with Amazon SageMaker Task Governance.

Integration with HyperPod Observability

SageMaker HyperPod offers a managed observability experience through the newly launched the HyperPod Monitoring and Observability EKS add-on. The observability add-on automatically populates Kubeflow Training metrics in Grafana dashboards out of the box, but for HyperPod PyTorch job metrics, you would have to turn on the advanced training metrics which leverage the HyperPod training operator to show information around job downtime, job recovery and faults, and downtime.

To get these advanced metrics, you can refer to Setting up the SageMaker HyperPod observability add-on. This helps to streamline the process of manually setting up a scraper and building dashboards.

Clean up

To avoid incurring unnecessary charges, clean up the resources created in this walkthrough.

Delete training jobs

Remove all HyperPod training jobs:

kubectl delete hyperpodpytorchjobs --all

Verify jobs are deleted:

kubectl get hyperpodpytorchjobs

Remove container images

Delete the ECR repository and images:

aws ecr delete-repository --repository-name fsdp --force

Remove add-ons:

Remove the following add-ons:

Remove the Amazon SageMaker HyperPod training operator add-on:

Open the Amazon SageMaker console
Navigate to your cluster’s details page
On the Add-ons tab, select the Amazon SageMaker HyperPod training operator
Choose Remove

Remove the cert manager add-on:

Open the Amazon EKS console
Navigate to your EKS cluster’s Add-ons page
Select Cert Manager and choose Remove

Additional clean up

Consider removing these resources if no longer needed:

Any persistent volumes created during training
CloudWatch log groups (if you want to retain logs, leave these)
Custom IAM roles created specifically for this example
The HyperPod cluster itself (if no longer needed).

Conclusion

As organizations continue to push the boundaries of AI model development, tools like the Amazon SageMaker HyperPod training operator can be used to maintain efficiency and reliability at scale. Amazon SageMaker HyperPod training operator offers a robust solution to common challenges in large model training. Key takeaways include:

One-click installation through AWS SageMaker HyperPod cluster console user-interface.
Custom rendezvous backend eliminates initialization and worker synchronization overhead which results in faster job starts and recovery.
Process level restarts maximize recovery efficiency when runtime faults occur.
Customizable hang job detection during training.
Comprehensive monitoring for early detection of training issues.
Out-of-box integration with existing HyperPod resiliency features.

To get started with the Amazon SageMaker HyperPod training operator, follow the setup instructions provided in this post and explore the example training job to understand how it can benefit your specific use case. For more information and best practices, visit the Amazon SageMaker documentation.

About the authors

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker AI. He holds a Master’s degree from UIUC with a specialization in Data science. He specializes in Generative AI workloads, helping customers build and deploy LLM’s using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Haard Mehta is a Software Engineer with Amazon’s SageMaker AI team and holds a Master’s degree in Computer Science with a specialization in big data systems from Arizona State University. He has extensive experience building managed machine learning services at scale, with a focus on hardware resiliency and enabling customers to succeed in their AI use cases without complex infrastructure management. Haard enjoys exploring new places, photography, cooking, and road trips.

Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker AI Training team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries, and Broadway shows.