Reduce conversational AI response time through inference at the edge with AWS Local Zones

Recent advances in generative AI have led to the proliferation of new generation of conversational AI assistants powered by foundation models (FMs). These latency-sensitive applications enable real-time text and voice interactions, responding naturally to human conversations. Their applications span a variety of sectors, including customer service, healthcare, education, personal and business productivity, and many others.

Conversational AI assistants are typically deployed directly on users’ devices, such as smartphones, tablets, or desktop computers, enabling quick, local processing of voice or text input. However, the FM that powers the assistant’s natural language understanding and response generation is usually cloud-hosted, running on powerful GPUs. When a user interacts with the AI assistant, their device first processes the input locally, including speech-to-text (STT) conversion for voice agents, and compiles a prompt. This prompt is then securely transmitted to the cloud-based FM over the network. The FM analyzes the prompt and begins generating an appropriate response, streaming it back to the user’s device. The device further processes this response, including text-to-speech (TTS) conversion for voice agents, before presenting it to the user. This efficient workflow strikes a balance between the powerful capabilities of cloud-based FMs and the convenience and responsiveness of local device interaction, as illustrated in the following figure.

A critical challenge in developing such applications is reducing response latency to enable real-time, natural interactions. Response latency refers to the time between the user finishing their speech and beginning to hear the AI assistant’s response. This delay typically comprises two primary components:

On-device processing latency – This encompasses the time required for local processing, including TTS and STT operations.
Time to first token (TTFT) – This measures the interval between the device sending a prompt to the cloud and receiving the first token of the response. TTFT consists of two components. First is the network latency, which is the round-trip time for data transmission between the device and the cloud. Second is the first token generation time, which is the period between the FM receiving a complete prompt and generating the first output token. TTFT is crucial for user experience in conversational AI interfaces that use response streaming with FMs. With response streaming, users start receiving the response while it’s still being generated, significantly improving perceived latency.

The ideal response latency for humanlike conversation flow is generally considered to be in the 200–500 milliseconds (ms) range, closely mimicking natural pauses in human conversation. Given the additional on-device processing latency, achieving this target requires a TTFT well below 200 ms.

Although many customers focus on optimizing the technology stack behind the FM inference endpoint through techniques such as model optimization, hardware acceleration, and semantic caching to reduce the TTFT, they often overlook the significant impact of network latency. This latency can vary considerably due to geographic distance between users and cloud services, as well as the diverse quality of internet connectivity.

Hybrid architecture with AWS Local Zones

To minimize the impact of network latency on TTFT for users regardless of their locations, a hybrid architecture can be implemented by extending AWS services from commercial Regions to edge locations closer to end users. This approach involves deploying additional inference endpoints on AWS edge services and using Amazon Route 53 to implement dynamic routing policies, such as geolocation routing, geoproximity routing, or latency-based routing. These strategies dynamically distribute traffic between edge locations and commercial Regions, providing fast response times based on real-time network conditions and user locations.

AWS Local Zones are a type of edge infrastructure deployment that places select AWS services close to large population and industry centers. They enable applications requiring very low latency or local data processing using familiar APIs and tool sets. Each Local Zone is a logical extension of a corresponding parent AWS Region, which means customers can extend their Amazon Virtual Private Cloud (Amazon VPC) connections by creating a new subnet with a Local Zone assignment.

This guide demonstrates how to deploy an open source FM from Hugging Face on Amazon Elastic Compute Cloud (Amazon EC2) instances across three locations: a commercial AWS Region and two AWS Local Zones. Through comparative benchmarking tests, we illustrate how deploying FMs in Local Zones closer to end users can significantly reduce latency—a critical factor for real-time applications such as conversational AI assistants.

Prerequisites

To run this demo, complete the following prerequisites:

Create an AWS account, if you don’t already have one.
Enable the Local Zones in Los Angeles and Honolulu in the parent Region US West (Oregon). For a full list of available Local Zones, refer to the Local Zones locations page. Next, create a subnet inside each Local Zone. Detailed instructions for enabling Local Zones and creating subnets within them can be found at Getting started with AWS Local Zones.
Submit an Amazon EC2 service quota increase for access to Amazon EC2 G4dn instances. Select the Running On-Demand G and VT instances as the quota type and at least 24 vCPUs for the quota size.
Create a Hugging Face read token from huggingface.co/settings/tokens.

Solution walkthrough

This section walks you through the steps to launch an Amazon EC2 G4dn instance and deploy an FM for inference in the Los Angeles Local Zone. The instructions are also applicable for deployments in the parent Region, US West (Oregon), and the Honolulu Local Zone.

We use Meta’s open source Llama 3.2-3B as the FM for this demonstration. This is a lightweight FM from the Llama 3.2 family, classified as a small language model (SLM) due to its small number of parameters. Compared to large language models (LLMs), SLMs are more efficient and cost-effective to train and deploy, excel when fine-tuned for specific tasks, offer faster inference times, and have lower resource requirements. These characteristics make SLMs particularly well-suited for deployment on edge services such as AWS Local Zones.

To launch an EC2 instance in the Los Angeles Local Zone subnet, follow these steps:

On the Amazon EC2 console dashboard, in the Launch instance box, choose Launch instance.
Under Name and tags, enter a descriptive name for the instance (for example, la-local-zone-instance).
Under Application and OS Images (Amazon Machine Image), select an AWS Deep Learning AMI that comes preconfigured with NVIDIA OSS driver and PyTorch. For our deployment, we used Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Amazon Linux 2).
Under Instance type, from the Instance type list, select the hardware configuration for your instance that’s supported in a Local Zone. We selected G4dn.2xlarge for this solution. This instance is equipped with one NVIDIA T4 Tensor Core GPU and 16 GB of GPU memory, which makes it ideal for high performance and cost-effective inference of SLMs on the edge. Available instance types for each Local Zone can be found at AWS Local Zones features. Review the hardware requirements for your FM to select the appropriate instance.
Under Key pair (login), choose an existing key pair or create a new one.
Next to Network settings, choose Edit, and then:

Select your VPC.
Select your Local Zone subnet.
Create a security group or select an existing one. Configure the security group’s inbound rules to allow traffic only from your client’s IP address on port 8080.

You can keep the default selections for the other configuration settings for your instance. To determine the storage types that are supported, refer to the Compute and storage section in AWS Local Zones features.
Review the summary of your instance configuration in the Summary panel and, when you’re ready, choose Launch instance.
A confirmation page lets you know that your instance is launching. Choose View all instances to close the confirmation page and return to the console.

Next, complete the following steps to deploy Llama 3.2-3B using the Hugging Face Text Generation Inference (TGI) as the model server:

Connect by using Secure Shell (SSH) into the instance
Start the docker service using the following command. This comes preinstalled with the AMI we selected.

sudo service docker start

Run the following command to download and run the Docker image for TGI server as well as Llama 3.2-3B model. In our deployment, we used Docker image version 2.4.0, but results might vary based on your selected version. The full list of supported models by TGI can be found at Hugging Face Supported Models. For more details about the deployment and optimization of TGI, refer to this text-generation-inference GitHub page.

model=meta-llama/Llama-3.2-3B
volume=$PWD/data
token=<ENTER YOUR HUGGING FACE TOKEN>

sudo docker run -d –gpus all
–shm-size 1g
-e HF_TOKEN=$token
-p 8080:80
-v $volume:/data ghcr.io/huggingface/text-generation-inference:2.4.0
–model-id $model

After the TGI container is running, you can test your endpoint by running the following command from your local environment:

curl <REPLACE WITH YOUR EC2 PUBLIC IP >:8080/generate -X POST
-d ‘{“inputs”:”What is deep learning?”,”parameters”:{“max_new_tokens”:200, “temperature”:0.2, “top_p”:0.9}}’
-H ‘Content-Type: application/json’

Performance evaluation

To demonstrate TTFT improvements with FM inference on Local Zones, we followed the steps in the previous section to deploy Llama 3.2 3B in three locations: in the us-west-2-c Availability Zone in the parent Region, US West (Oregon); in the us-west-2-lax-1a Local Zone in Los Angeles; and in the us-west-2-hnl-1a Local Zone in Honolulu. This is illustrated in the following figure. Notice that the architecture provided in this post is meant to be used for performance evaluation in a development environment. Before migrating any of the provided architecture to production, we recommend following the AWS Well-Architected Framework.

We conducted two separate test scenarios to evaluate TTFT as explained in the following:

Los Angeles test scenario:

Test user’s location – Los Angeles metropolitan area
Test A – 150 requests sent to FM deployed in Los Angeles Local Zone
Test B – 150 requests sent to FM deployed in US West (Oregon)

Honolulu test scenario:

Test user’s location – Honolulu metropolitan area
Test C – 150 requests sent to FM deployed in Honolulu Local Zone
Test D – 150 requests sent to FM deployed in US West (Oregon)

Evaluation setup

To conduct TTFT measurements, we use the load testing capabilities of the open source project LLMPerf. This tool launches multiple requests from the test user’s client to the FM endpoint and measures various performance metrics, including TTFT. Each request contains a random prompt with a mean token count of 250 tokens. Although a single prompt for short-form conversations typically consists of 50 tokens, we set the mean input token size to 250 tokens to account for multi-turn conversation history, system prompts, and contextual information that better represents real-world usage patterns.

Detailed instructions for installing LLMPerf and executing the load testing are available in the project’s documentation. Additionally, because we are using the Hugging Face TGI as the inference server, we follow the corresponding instructions from LLMPerf to perform the load testing. The following is the example command to initiate the load testing from the command line:

export HUGGINGFACE_API_BASE=”http://<REPLACE WITH YOUR EC2 PUBLIC IP>:8080″
export HUGGINGFACE_API_KEY=””

python token_benchmark_ray.py
  –model “huggingface/meta-llama/Llama-3.2-3B”
  –mean-input-tokens 250
  –stddev-input-tokens 50
  –mean-output-tokens 100
  –stddev-output-tokens 20
  –max-num-completed-requests 150
  –timeout 600
  –num-concurrent-requests 1
  –results-dir “result_outputs”
  –llm-api “litellm”
  –additional-sampling-params ‘{}’

Each test scenario compares the TTFT latency between Local Zone and the parent Region endpoints to assess the impact of geographical distance. Latency results might vary based on several factors, including:

Test parameters and configuration
Time of day and network traffic
Internet service provider
Specific client location within the test Region
Current server load

Results

The following tables below present TTFT measurements in milliseconds (ms) for two distinct test scenarios. The results demonstrate significant TTFT reductions when using a Local Zone compared to the parent Region for both the Los Angeles and the Honolulu test scenarios. The observed differences in TTFT are solely attributed to network latency because identical FM inference configurations were employed in both the Local Zone and the parent Region.

User location: Los Angeles Metropolitan Area

LLM inference endpoint
Mean (ms)
Min (ms)
P25 (ms)
P50 (ms)
P75 (ms)
P95 (ms)
P99 (ms)
Max (ms)

Parent Region: US West (Oregon)
135
118
125
130
139
165
197
288

Local Zone: Los Angeles
80
50
72
75
86
116
141
232

The user in Los Angeles achieved a mean TTFT of 80 ms when calling the FM endpoint in the Los Angeles Local Zone, compared to 135 ms for the endpoint in the US West (Oregon) Region. This represents a 55 ms (about 41%) reduction in latency.

User location: Honolulu Metropolitan Area

LLM inference endpoint
Mean (ms)
Min (ms)
P25 (ms)
P50 (ms)
P75 (ms)
P95 (ms)
P99 (ms)
Max (ms)

Parent Region: US West (Oregon)
197
172
180
183
187
243
472
683

Local Zone: Honolulu
114
58
70
85
164
209
273
369

The user in Honolulu achieved a mean TTFT of 114 ms when calling the FM endpoint in the Honolulu Local Zone, compared to 197 ms for the endpoint in the US West (Oregon) Region. This represents an 83 ms (about 42%) reduction in latency.

Moreover, the TTFT reduction achieved by Local Zone deployments is consistent across all metrics in both test scenarios, from minimum to maximum values and throughout all percentiles (P25–P99), indicating a consistent improvement across all requests.

Finally, remember that TTFT is just one component of overall response latency, alongside on-device processing latency. By reducing TTFT using Local Zones, you create additional margin for on-device processing latency, making it easier to achieve the target response latency range needed for humanlike conversation.

Cleanup

In this post, we created Local Zones, subnets, security groups, and EC2 instances. To avoid incurring additional charges, it’s crucial to properly clean up these resources when they’re no longer needed. To do so, follow these steps:

Terminate the EC2 instances and delete their associated Amazon Elastic Block Store (Amazon EBS) volumes.
Delete the security groups and subnets.
Disable the Local Zones.

Conclusion

In conclusion, this post highlights how edge computing services, such as AWS Local Zones, play a crucial role in reducing FM inference latency for conversational AI applications. Our test deployments of Meta’s Llama 3.2-3B demonstrated that placing FM inference endpoints closer to end users through Local Zones dramatically reduces TTFT compared to traditional Regional deployments. This TTFT reduction plays a critical role in optimizing the overall response latency, helping achieve the target response times essential for natural, humanlike interactions regardless of user location.

To use these benefits for your own applications, we encourage you to explore the AWS Local Zones documentation. There, you’ll find information on available locations and supported AWS services so you can bring the power of edge computing to your conversational AI solutions.

About the Authors

Nima Seifi is a Solutions Architect at AWS, based in Southern California, where he specializes in SaaS and LLMOps. He serves as a technical advisor to startups building on AWS. Prior to AWS, he worked as a DevOps architect in the e-commerce industry for over 5 years, following a decade of R&D work in mobile internet technologies. Nima has authored 20+ technical publications and holds 7 U.S. patents. Outside of work, he enjoys reading, watching documentaries, and taking beach walks.

Nelson Ong is a Solutions Architect at Amazon Web Services. He works with early stage startups across industries to accelerate their cloud adoption.