DeepSeek-R1 is a large language model (LLM) developed by DeepSeek AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities, DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning, and data interpretation tasks.
DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert clusters. This approach allows the model to specialize in different problem domains while maintaining overall efficiency.
DeepSeek-R1 distilled models bring the reasoning capabilities of the main R1 model to more efficient architectures based on popular open models like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a process of training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model, using it as a teacher model. For example, DeepSeek-R1-Distill-Llama-8B offers an excellent balance of performance and efficiency. By integrating this model with Amazon SageMaker AI, you can benefit from the AWS scalable infrastructure while maintaining high-quality language model capabilities.
In this post, we show how to use the distilled models in SageMaker AI, which offers several options to deploy the distilled versions of the R1 model.
Solution overview
You can use DeepSeek’s distilled models within the AWS managed machine learning (ML) infrastructure. We demonstrate how to deploy these models on SageMaker AI inference endpoints.
SageMaker AI offers a choice of which serving container to use for deployments:
LMI container – A Large Model Inference (LMI) container with different backends (vLLM, TensortRT-LLM, and Neuron). See the following GitHub repo for more details.
TGI container – A Hugging Face Text Generation Interface (TGI) container. You can find more details in the following GitHub repo.
In the following code snippets, we use the LMI container example. See the following GitHub repo for more deployment examples using TGI, TensorRT-LLM, and Neuron.
LMI containers
LMI containers are a set of high-performance Docker containers purpose built for LLM inference. With these containers, you can use high-performance open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle together a model server with open source inference libraries to deliver an all-in-one LLM serving solution.
LMI containers provide many features, including:
Optimized inference performance for popular model architectures like Meta Llama, Mistral, Falcon, and more
Integration with open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
Continuous batching for maximizing throughput at high concurrency
Token streaming
Quantization through AWQ, GPTQ, FP8, and more
Multi-GPU inference using tensor parallelism
Serving LoRA fine-tuned models
Text embedding to convert text data into numeric vectors
Speculative decoding support to decrease latency
LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables you to use the latest optimizations and technologies across libraries. To learn more about the LMI components, see Components of LMI.
Prerequisites
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.2xlarge SageMaker hosting instance.
Deploy DeepSeek-R1 for inference
The following is a step-by-step example that demonstrates how to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the model is provided in the GitHub repo. You can clone the repo and run the notebook from SageMaker AI Studio.
Configure the SageMaker execution role and import the necessary libraries:
There are two ways to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:
Deploy uncompressed model weights from an Amazon S3 bucket – In this scenario, you need to set the HF_MODEL_ID variable to the Amazon Simple Storage Service (Amazon S3) prefix that has model artifacts. This method is generally much faster, with the model typically downloading in just a couple of minutes from Amazon S3.
Deploy directly from Hugging Face Hub (requires internet access) – To do this, set HF_MODEL_ID to the Hugging Face repository or model ID (for example, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). However, this method tends to be slower and can take significantly longer to download the model compared to using Amazon S3. This approach will not work if enable_network_isolation is enabled, because it requires internet access to retrieve model artifacts from the Hugging Face Hub.
In this example, we deploy the model directly from the Hugging Face Hub:
The OPTION_MAX_ROLLING_BATCH_SIZE parameter limits number of concurrent requests that can be processed by the endpoint. We set it to 16 to limit GPU memory requirements. You should adjust it based on your latency and throughput requirements.
Create and deploy the model:
Make inference requests:
Performance and cost considerations
The ml.g5.2xlarge instance provides a good balance of performance and cost. For large-scale inference, use larger batch sizes for real-time inference to optimize cost and performance. You can also use batch transform for offline, large-volume inference to reduce costs. Monitor endpoint usage to optimize costs.
Clean up
Clean up your resources when they’re no longer needed:
Security
You can configure advanced security and infrastructure settings for the DeepSeek-R1 model, including virtual private cloud (VPC) networking, service role permissions, encryption settings, and EnableNetworkIsolation to restrict internet access. For production deployments, it’s essential to review these settings to maintain alignment with your organization’s security and compliance requirements.
By default, the model runs in a shared AWS managed VPC with internet access. To enhance security and control access, you should explicitly configure a private VPC with appropriate security groups and IAM policies based on your requirements.
SageMaker AI provides enterprise-grade security features to help keep your data and applications secure and private. We do not share your data with model providers, unless you direct us to, providing you full control over your data. This applies to all models—both proprietary and publicly available, including DeepSeek-R1 on SageMaker.
For more details, see Configure security in Amazon SageMaker AI.
Logging and monitoring
You can monitor SageMaker AI using Amazon CloudWatch, which collects and processes raw data into readable, near real-time metrics. These metrics are retained for 15 months, allowing you to analyze historical trends and gain deeper insights into your application’s performance and health.
Additionally, you can configure alarms to monitor specific thresholds and trigger notifications or automated actions when those thresholds are met, helping you proactively manage your deployment.
For more details, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.
Best practices
It’s always recommended to deploy your LLMs endpoints inside your VPC and behind a private subnet, without internet gateways, and preferably with no egress. Ingress from the internet should also be blocked to minimize security risks.
Always apply guardrails to make sure incoming and outgoing model responses are validated for safety, bias, and toxicity. You can guard your SageMaker endpoints model responses with Amazon Bedrock Guardrails. See DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart for more details.
Inference performance evaluation
In this section, we focus on inference performance of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the performance of LLMs in terms of end-to-end latency, throughput, and resource efficiency is crucial for providing responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly impacts user experience, system reliability, and deployment feasibility at scale. For this post, we test all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—along four performance metrics:
End-to-end latency (time between sending a request and receiving the response)
Throughput tokens
Time to first token
Inter-token latency
The main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware for generic traffic patterns. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and traffic patterns as well as I/O sequence length.
Scenarios
We tested the following scenarios:
Container/model configuration – We used LMI container v14 with default parameters, except MAX_MODEL_LEN, which was set to 10000 (no chunked prefix and no prefix caching). On instances with multiple accelerators, we sharded the model across all available GPUs.
Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on performance benchmarks using two sample input token lengths. We ran both tests 50 times each before measuring the average across the different metrics. Then we repeated the test with concurrency 10.
Short-length test – 512 input tokens and 256 output tokens.
Medium-length test – 3072 input tokens and 256 output tokens.
Hardware – We tested the distilled variants on a variety of instance types ranging from 1, 4, or 8 GPUs per instance. In the following table, a green cell indicates that a model was tested on that particular instance type, and red indicates that a model wasn’t tested with that instance type, either because the instance was excessive for a given model size or too small to fit the model in memory.
Box plots
In the following sections, we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data, with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness, as illustrated in the following figure.
DeepSeek-R1-Distill-Qwen-1.5B
This model can be deployed on a single GPU instance. The results indicate that the ml.g5.xlarge instance outperforms the ml.g6.xlarge instance across all measured performance criteria and concurrency settings.
The following figure illustrates testing with concurrency = 1.
The following figure illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.2xlarge and ml.g6e.2xlarge. Among all instances, ml.g6e.2xlarge demonstrated the highest performance.
The following figure illustrates testing with concurrency = 1.
The following figure illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.2xlarge, ml.g5.12xlarge, ml.g6e.2xlarge, and ml.g6e.12xlarge, with ml.g6e.12xlarge demonstrating the highest performance among all instances.
The following figure illustrates testing with concurrency = 1.
The following figure illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Qwen-14B
We tested this model on ml.g6.12xlarge, ml.g5.12xlarge, ml.g6e.48xlarge, and ml.g6e.12xlarge. The instance with 8 GPU (ml.g6e.48xlarge) showed the best results.
The following figure illustrates testing with concurrency = 1.
The following figure illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Qwen-32B
This is a fairly large model, and we only deployed it on multi-GPU instances: ml.g6.12xlarge, ml.g5.12xlarge, and ml.g6e.12xlarge. The latest generation (ml.g6e.12xlarge) showed the best performance across all concurrency settings.
The following figure illustrates testing with concurrency = 1.
The following figure illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Llama-70B
We tested this model on two different 8 GPUs instances: ml.g6e.48xlarge and ml.p4d.24xlarge. The latter showed the best performance.
The following figure illustrates testing with concurrency = 1.
The following figure illustrates testing with concurrency = 10.
Conclusion
Deploying DeepSeek models on SageMaker AI provides a robust solution for organizations seeking to use state-of-the-art language models in their applications. The combination of DeepSeek’s powerful models and SageMaker AI managed infrastructure offers a scalable and efficient approach to natural language processing tasks.
The performance evaluation section presents a comprehensive performance evaluation of all DeepSeek-R1 distilled models across four key inference metrics, using 13 different NVIDIA accelerator instance types. This analysis offers valuable insights to assist in the selection of the optimal instance type for deploying the DeepSeek-R1 solution.
Check out the complete code in the following GitHub repos:
Deploy a DeepSeek model Quantized LLaMA 3.1 70B Instruct Model Using SageMaker Endpoints and SageMaker Large Model Inference (LMI) Container
Deploy DeepSeek R1 Large Language Model from HuggingFace Hub on Amazon SageMaker
Deploy deepseek-ai/DeepSeek-R1-Distill-* models on Amazon SageMaker using LMI container
Deploy DeepSeek R1 Llama on AWS Inferentia using SageMaker Large Model Inference Container
Interactive fine-tuning of Foundation Models with Amazon SageMaker Training using @remote decorator
For additional resources, refer to:
Amazon SageMaker Documentation
Deepseek Model Hub
Hugging Face on Amazon SageMaker
About the Authors
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.
Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.