Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart

Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart

Today, we are excited to announce the availability of Llama 3.2 models in Amazon SageMaker JumpStart. Llama 3.2 offers multi-modal vision and lightweight models representing Meta’s latest advancement in large language models (LLMs), providing enhanced capabilities and broader applicability across various use cases. With a focus on responsible innovation and system-level safety, these new models demonstrate state-of-the-art performance on a wide range of industry benchmarks and introduce features that help you build a new generation of AI experiences. SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.

In this post, we show how you can discover and deploy the Llama 3.2 11B Vision model using SageMaker JumpStart. We also share the supported instance types and context for all the Llama 3.2 models available in SageMaker JumpStart. Although not highlighted in this blog, you can also use the lightweight models along with fine-tuning using SageMaker JumpStart.

Llama 3.2 models are available in SageMaker JumpStart initially in the US East (Ohio) AWS Region. Please note that Meta has restrictions on your usage of the multi-modal models if you are located in the European Union. See Meta’s community license agreement for more details.

Llama 3.2 overview

Llama 3.2 represents Meta’s latest advancement in LLMs. Llama 3.2 models are offered in various sizes, from small and medium-sized multi-modal models. The larger Llama 3.2 models come in two parameter sizes—11B and 90B—with 128,000 context length, and are capable of sophisticated reasoning tasks including multi-modal support for high resolution images. The lightweight text-only models come in two parameter sizes—1B and 3B—with 128,000 context length, and are suitable for edge devices. Additionally, there is a new safeguard Llama Guard 3 11B Vision parameter model, which is designed to support responsible innovation and system-level safety.

Llama 3.2 is the first Llama model to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. With a focus on responsible innovation and system-level safety, Llama 3.2 models help you build and deploy cutting-edge generative AI models to ignite new innovations like image reasoning and are also more accessible for on-edge applications. The new models are also designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker Inference instances, including AWS Trainium and AWS Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This enforces data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker, you can streamline the entire model deployment process.

Prerequisites

To try out the Llama 3.2 models in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the LLMs.

Discover Llama 3.2 models in SageMaker JumpStart

SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive IDE that offers a unified, web-based interface for performing all aspects of the ML development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI/ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.

Deploy Llama 3.2 multi-modality models for inference using SageMaker JumpStart

On the SageMaker JumpStart landing page, you can discover all public pre-trained models offered by SageMaker. You can choose the Meta model provider tab to discover all the Meta models available in SageMaker.

If you’re using SageMaker Classic Studio and don’t see the Llama 3.2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Classic Apps.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.

When you choose either button, a pop-up window will show the End-User License Agreement (EULA) and acceptable use policy for you to accept.

Upon acceptance, you can proceed to the next step to use the model.

Deploy Llama 3.2 11B Vision model for inference using the Python SDK

When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker.

You can deploy a Llama 3.2 11B Vision model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(model_id = “meta-vlm-llama-3-2-11b-vision”)
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    “messages”: [
        {“role”: “system”, “content”: “You are a helpful assistant”},
        {“role”: “user”, “content”: “How are you doing today”},
        {“role”: “assistant”, “content”: “Good, what can i help you with today?”},
        {“role”: “user”, “content”: “Give me 5 steps to become better at tennis?”}
    ],
    “temperature”: 0.6,
    “top_p”: 0.9,
    “max_tokens”: 512,
    “logprobs”: False
}
response = predictor.predict(payload)
response_message = response[‘choices’][0][‘message’][‘content’]

Recommended instances and benchmark

The following table lists all the Llama 3.2 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model Name
Model ID
Default instance type
Supported instance types

Llama-3.2-1B
meta-textgeneration-llama-3-2-1b,
meta-textgenerationneuron-llama-3-2-1b
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge

Llama-3.2-1B-Instruct
meta-textgeneration-llama-3-2-1b-instruct,
meta-textgenerationneuron-llama-3-2-1b-instruct
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge

Llama-3.2-3B
meta-textgeneration-llama-3-2-3b,
meta-textgenerationneuron-llama-3-2-3b
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge

Llama-3.2-3B-Instruct
meta-textgeneration-llama-3-2-3b-instruct,
meta-textgenerationneuron-llama-3-2-3b-instruct
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge

Llama-3.2-11B-Vision
meta-vlm-llama-3-2-11b-vision
ml.p4d.24xlarge (125K context length)
p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge

Llama-3.2-11B-Vision-Instruct
meta-vlm-llama-3-2-11b-vision-instruct
ml.p4d.24xlarge (125K context length)
p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge

Llama-3.2-90B-Vision
meta-vlm-llama-3-2-90b-vision
ml.p5.24xlarge (125K context length)
p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge

Llama-3.2-90B-Vision-Instruct
meta-vlm-llama-3-2-90b-vision-instruct
ml.p5.24xlarge (125K context length)
p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge

Llama-Guard-3-11B-Vision
meta-vlm-llama-guard-3-11b-vision
ml.p4d.24xlarge
p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge

Llama 3.2 models have been evaluated on over 150 benchmark datasets, demonstrating competitive performance with leading FMs.

Inference and example prompts for Llama-3.2 11B Vision

You can use Llama 3.2 11B and 90B models for text and image or vision reasoning use cases. You can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more. Input payload to the endpoint looks like the following code examples.

Text-only input

The following is an example of text-only input:

{
‘messages’: [
{‘role’: ‘system’, ‘content’: ‘You are a helpful assistant’},
{‘role’: ‘user’, ‘content’: ‘How are you doing today’},
{‘role’: ‘assistant’, ‘content’: ‘Good, what can i help you with today?’},
{‘role’: ‘user’, “content”: ‘Give me 5 steps to become better at tennis?’}
],
‘temperature’: 0.6,
‘top_p’: 0.9,
‘max_tokens’: 512,
‘logprobs’: False
}

This produces the following response:

Here are 5 steps to help you improve your tennis game:

**Step 1: Develop a Consistent Serve**

A strong serve is the foundation of a good tennis game. Practice your serve regularly to develop a consistent and powerful shot. Focus on proper footwork, grip, and body positioning. Experiment with different serves, such as topspin and slice, to keep your opponent guessing.

**Step 2: Improve Your Footwork and Agility**

Good footwork is essential for getting into position to hit shots and covering the court. Practice shuffling, sliding, and turning to improve your agility and reaction time. Incorporate exercises like ladder drills and cone drills into your training to enhance your foot speed and agility.

**Step 3: Focus on Your Volley and Overhead Shots**

The volley and overhead shots are critical for winning points at the net. Practice hitting volleys and overheads with both forehand and backhand grips. Focus on keeping your racquet close to the ball and using your wrist and arm to generate power.

**Step 4: Work on Your Return of Serve**

The return of serve is a crucial shot in tennis, as it sets up the point for your opponent. Practice returning serves with different types of spin, such as topspin and slice. Focus on getting your racquet head to the ball early and using your legs to generate power.

**Step 5: Analyze Your Game and Practice with a Purpose**

To improve your game, you need to identify areas for improvement and practice with a purpose. Record your matches and analyze your game to identify weaknesses and areas for improvement. Create a practice plan that targets specific areas, such as your forehand or backhand, and focus on making progress in those areas.

Remember, improvement takes time and practice. Stay committed, and with consistent effort, you’ll see improvement in your tennis game!

Single-image input

You can set up vision-based reasoning tasks with Llama 3.2 models with SageMaker JumpStart as follows:

import requests
import base64

def url_to_base64(image_url):
# Download the image
response = requests.get(image_url)
if response.status_code != 200:
return None

# Encode the image content to base64
image_base64 = base64.b64encode(response.content).decode(‘utf-8’)
return image_base64

Let’s load an image from the open source MATH-Vision dataset:

url = “https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg”
image_data = url_to_base64(url)

We can structure the message object with our base64 image data:

{
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Which of these figures differs from the rest four?”
},
{
“type”: “image_url”,
“image_url”: {
“url”: f”””data:image/jpeg;base64,{image_data}”””
}
}
]
}
],
“temperature”: 0.4,
“top_p”: 0.95,
“max_tokens”: 512,
“logprobs”: False
}

This produces the following response:

The figure labeled E is the one that differs from the rest.

Multi-image input

The following code is an example of multi-image input:

image_data1 = url_to_base64(“https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2023/04/27/DBBLOG-2448-SingleAZInstance.png”)
image_data2 = url_to_base64(“https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2023/04/27/DBBLOG-2448-MultiAZInstance.png”)
payload = {
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “There are two images. Describe each one and why they are related”
},
{
“type”: “image_url”,
“image_url”: {
“url”: f”””data:image/jpeg;base64,{image_data1}”””
}
},
{
“type”: “image_url”,
“image_url”: {
“url”: f”””data:image/jpeg;base64,{image_data2}”””
}
}
]
}
],
“temperature”: 0.6,
“top_p”: 0.9,
“max_tokens”: 512,
“logprobs”: False
}

This produces the following response:

The two images depict different architectures for a distributed database system. The main difference between them is the location of the database instances and how they are connected.

**Image 1: Centralized Architecture**

In this image, there is a single database instance (DB1) that serves as the central repository for all data. This instance is connected to multiple application servers (App1, App2, App3) through a network. The application servers interact with the database instance to read and write data.

**Image 2: Distributed Architecture**

In this image, there are multiple database instances (DB1, DB2, DB3) that are distributed across different locations. Each database instance is connected to a local application server (App1, App2, App3) through a network. The application servers interact with their local database instances to read and write data.

The key differences between the two images are:

* **Scalability**: The distributed architecture (Image 2) allows for easier scalability, as new database instances can be added or removed as needed, without affecting the overall system.
* **Availability**: The distributed architecture provides higher availability, as if one database instance goes down, the other instances can still serve requests.
* **Performance**: The centralized architecture (Image 1) may experience performance issues if the central database instance becomes a bottleneck, while the distributed architecture can distribute the load across multiple instances.
* **Data consistency**: The centralized architecture may have easier data consistency management, as all data is stored in a single location, while the distributed architecture requires more complex mechanisms to ensure data consistency across multiple instances.

In summary, the centralized architecture is suitable for small to medium-sized applications with low traffic, while the distributed architecture is more suitable for large-scale applications with high traffic and scalability requirements.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, to use the SageMaker console, complete the following steps:

On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
Search for the embedding and text generation endpoints.
On the endpoint details page, choose Delete.
Choose Delete again to confirm.

Conclusion

In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 3.2 models today. For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.

About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS
Armando Diaz is a Solutions Architect at AWS
Sharon Yu is a Software Development Engineer at AWS
Siddharth Venkatesan is a Software Development Engineer at AWS
Tony Lian is a Software Engineer at AWS
Evan Kravitz is a Software Development Engineer at AWS
Jonathan Guinegagne is a Senior Software Engineer at AWS
Tyler Osterberg is a Software Engineer at AWS
Sindhu Vahini Somasundaram is a Software Development Engineer at AWS
Hemant Singh is an Applied Scientist at AWS
Xin Huang is a Senior Applied Scientist at AWS
Adriana Simmons is a Senior Product Marketing Manager at AWS
June Won is a Senior Product Manager at AWS
Karl Albertsen is a Head of ML Algorithm and JumpStart at AWS

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top