How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

Preparation

Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

Option 1: Deploy TGI on Amazon EC2 Inf2

In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

Create a .env file with the following content:

MODEL_ID=’Qwen/Qwen2.5-7B-Instruct’
#MODEL_ID=’/data/exportedmodel’
HF_AUTO_CAST_TYPE=’bf16′ # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

Create a file named docker-compose.yaml with the following content:

version: ‘3.7’

services:
tgi-1:
image: ghcr.io/huggingface/neuronx-tgi:latest
ports:
– “8081:8081”
environment:
– PORT=8081
– MODEL_ID=${MODEL_ID}
– HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
– HF_NUM_CORES=2
– MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
– MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
– MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
– MAX_CONCURRENT_REQUESTS=512
#- HF_TOKEN=${HF_TOKEN} #only needed for gated models
volumes:
– $PWD:/data #can be removed if you aren’t loading locally
devices:
– “/dev/neuron0”

Use docker compose to deploy the model:

docker compose -f docker-compose.yaml –env-file .env up

To confirm that the model deployed correctly, send a test prompt to the model:

curl 127.0.0.1:8081/generate
-X POST
-d ‘{
“inputs”:”Tell me about AWS.”,
“parameters”:{
“max_new_tokens”:60
}
}’
-H ‘Content-Type: application/json’

To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:

#”Tell me how to open an AWS account”
curl 127.0.0.1:8081/generate
-X POST
-d ‘{
“inputs”:”告诉我如何开设 AWS 账户。”,
“parameters”:{
“max_new_tokens”:60
}
}’
-H ‘Content-Type: application/json’

Option 2: Deploy TGI on SageMaker

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

Copy the example code into a SageMaker notebook, then choose Run.
The notebook you copied will look like the following:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client(“iam”)
role = iam.get_role(RoleName=”sagemaker_execution_role”)[“Role”][“Arn”]

# Hub Model configuration. https://huggingface.co/models
hub = {
“HF_MODEL_ID”: “Qwen/Qwen2.5-7B-Instruct”,
“HF_NUM_CORES”: “2”,
“HF_AUTO_CAST_TYPE”: “bf16”,
“MAX_BATCH_SIZE”: “8”,
“MAX_INPUT_TOKENS”: “3686”,
“MAX_TOTAL_TOKENS”: “4096”,
}

region = boto3.Session().region_name
image_uri = f”763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04″

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=image_uri,
env=hub,
role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=”ml.inf2.xlarge”,
container_startup_health_check_timeout=1800,
volume_size=512,
)

# send request
predictor.predict(
{
“inputs”: “What is is the capital of France?”,
“parameters”: {
“do_sample”: True,
“max_new_tokens”: 128,
“temperature”: 0.7,
“top_k”: 50,
“top_p”: 0.95,
}
}
)

Clean Up

Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

Terminate EC2 instances through the AWS Management Console.

Terminate a SageMaker endpoint through the console or with the following commands:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.

About the Authors

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.