This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.
As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.
In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.
NVIDIA NeMo Framework Overview
The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.
At its core, NeMo Framework provides model builders with:
Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
Enterprise-grade features with built-in capabilities including:
Advanced parallelism techniques
Memory optimization strategies
Distributed checkpointing
Streamlined deployment pipelines
By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.
The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:
Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.
Solution overview
In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.
The key steps to deploying this solution include:
Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.
Architecture diagram
The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.
Prerequisites
First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.
Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.
The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.
Use the following steps to deploy the prerequisite resources.
Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
Template parameters:
Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
All other parameters can be left as default or changed as needed for your use case.
Select the acknowledgement box in the Capabilities section and create the stack.
It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.
Launch the training job
With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.
Step 1: Set up a SageMaker HyperPod cluster
After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.
The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.
Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:
Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.
Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.
Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
Cluster name
It defines three instance groups
Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
Worker-group: The group of nodes that executes the actual model training workload
VPC configuration
Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.
Create the SageMaker HyperPod cluster
Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding
The following screenshot shows the results of the –output table command showing the cluster status as Creating.
The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.
The following screenshot shows the results of the –output table command showing the cluster status as InService.
Step 2: SSH into the cluster
After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information
Install the AWS SSM Session Manager Plugin.
Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.
Step 3: Interact with the cluster and clone the repository
After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.
View the existing partition and nodes per partition
List the jobs that are in the queue or running.
SSH to the compute nodes.
Clone the code sample GitHub repository onto the cluster controller node (head node).
Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.
Step 4: Build the job container
The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.
To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.
Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.
You can use the following command to build the Docker file and create a SquashFS image.
Step 5: Set up NeMo-Run and other dependencies on the head node
Before continuing:
NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.
To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:
Step 6: Launch the pretraining job using NeMo-Run
Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.
The following function defines the model configuration.
The following function defines the Slurm executor.
The following function runs the experiment.