Recent advances in generative AI have led to the proliferation of new generation of conversational AI assistants powered by foundation models (FMs). These latency-sensitive applications enable real-time text and voice interactions, responding naturally to human conversations. Their applications span a variety of sectors, including customer service, healthcare, education, personal and business productivity, and many others.
Conversational AI assistants are typically deployed directly on users’ devices, such as smartphones, tablets, or desktop computers, enabling quick, local processing of voice or text input. However, the FM that powers the assistant’s natural language understanding and response generation is usually cloud-hosted, running on powerful GPUs. When a user interacts with the AI assistant, their device first processes the input locally, including speech-to-text (STT) conversion for voice agents, and compiles a prompt. This prompt is then securely transmitted to the cloud-based FM over the network. The FM analyzes the prompt and begins generating an appropriate response, streaming it back to the user’s device. The device further processes this response, including text-to-speech (TTS) conversion for voice agents, before presenting it to the user. This efficient workflow strikes a balance between the powerful capabilities of cloud-based FMs and the convenience and responsiveness of local device interaction, as illustrated in the following figure.
A critical challenge in developing such applications is reducing response latency to enable real-time, natural interactions. Response latency refers to the time between the user finishing their speech and beginning to hear the AI assistant’s response. This delay typically comprises two primary components:
On-device processing latency – This encompasses the time required for local processing, including TTS and STT operations.
Time to first token (TTFT) – This measures the interval between the device sending a prompt to the cloud and receiving the first token of the response. TTFT consists of two components. First is the network latency, which is the round-trip time for data transmission between the device and the cloud. Second is the first token generation time, which is the period between the FM receiving a complete prompt and generating the first output token. TTFT is crucial for user experience in conversational AI interfaces that use response streaming with FMs. With response streaming, users start receiving the response while it’s still being generated, significantly improving perceived latency.
The ideal response latency for humanlike conversation flow is generally considered to be in the 200–500 milliseconds (ms) range, closely mimicking natural pauses in human conversation. Given the additional on-device processing latency, achieving this target requires a TTFT well below 200 ms.
Although many customers focus on optimizing the technology stack behind the FM inference endpoint through techniques such as model optimization, hardware acceleration, and semantic caching to reduce the TTFT, they often overlook the significant impact of network latency. This latency can vary considerably due to geographic distance between users and cloud services, as well as the diverse quality of internet connectivity.
Hybrid architecture with AWS Local Zones
To minimize the impact of network latency on TTFT for users regardless of their locations, a hybrid architecture can be implemented by extending AWS services from commercial Regions to edge locations closer to end users. This approach involves deploying additional inference endpoints on AWS edge services and using Amazon Route 53 to implement dynamic routing policies, such as geolocation routing, geoproximity routing, or latency-based routing. These strategies dynamically distribute traffic between edge locations and commercial Regions, providing fast response times based on real-time network conditions and user locations.
AWS Local Zones are a type of edge infrastructure deployment that places select AWS services close to large population and industry centers. They enable applications requiring very low latency or local data processing using familiar APIs and tool sets. Each Local Zone is a logical extension of a corresponding parent AWS Region, which means customers can extend their Amazon Virtual Private Cloud (Amazon VPC) connections by creating a new subnet with a Local Zone assignment.
This guide demonstrates how to deploy an open source FM from Hugging Face on Amazon Elastic Compute Cloud (Amazon EC2) instances across three locations: a commercial AWS Region and two AWS Local Zones. Through comparative benchmarking tests, we illustrate how deploying FMs in Local Zones closer to end users can significantly reduce latency—a critical factor for real-time applications such as conversational AI assistants.
Prerequisites
To run this demo, complete the following prerequisites:
Create an AWS account, if you don’t already have one.
Enable the Local Zones in Los Angeles and Honolulu in the parent Region US West (Oregon). For a full list of available Local Zones, refer to the Local Zones locations page. Next, create a subnet inside each Local Zone. Detailed instructions for enabling Local Zones and creating subnets within them can be found at Getting started with AWS Local Zones.
Submit an Amazon EC2 service quota increase for access to Amazon EC2 G4dn instances. Select the Running On-Demand G and VT instances as the quota type and at least 24 vCPUs for the quota size.
Create a Hugging Face read token from huggingface.co/settings/tokens.
Solution walkthrough
This section walks you through the steps to launch an Amazon EC2 G4dn instance and deploy an FM for inference in the Los Angeles Local Zone. The instructions are also applicable for deployments in the parent Region, US West (Oregon), and the Honolulu Local Zone.
We use Meta’s open source Llama 3.2-3B as the FM for this demonstration. This is a lightweight FM from the Llama 3.2 family, classified as a small language model (SLM) due to its small number of parameters. Compared to large language models (LLMs), SLMs are more efficient and cost-effective to train and deploy, excel when fine-tuned for specific tasks, offer faster inference times, and have lower resource requirements. These characteristics make SLMs particularly well-suited for deployment on edge services such as AWS Local Zones.
To launch an EC2 instance in the Los Angeles Local Zone subnet, follow these steps:
On the Amazon EC2 console dashboard, in the Launch instance box, choose Launch instance.
Under Name and tags, enter a descriptive name for the instance (for example, la-local-zone-instance).
Under Application and OS Images (Amazon Machine Image), select an AWS Deep Learning AMI that comes preconfigured with NVIDIA OSS driver and PyTorch. For our deployment, we used Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Amazon Linux 2).
Under Instance type, from the Instance type list, select the hardware configuration for your instance that’s supported in a Local Zone. We selected G4dn.2xlarge for this solution. This instance is equipped with one NVIDIA T4 Tensor Core GPU and 16 GB of GPU memory, which makes it ideal for high performance and cost-effective inference of SLMs on the edge. Available instance types for each Local Zone can be found at AWS Local Zones features. Review the hardware requirements for your FM to select the appropriate instance.
Under Key pair (login), choose an existing key pair or create a new one.
Next to Network settings, choose Edit, and then:
Select your VPC.
Select your Local Zone subnet.
Create a security group or select an existing one. Configure the security group’s inbound rules to allow traffic only from your client’s IP address on port 8080.
You can keep the default selections for the other configuration settings for your instance. To determine the storage types that are supported, refer to the Compute and storage section in AWS Local Zones features.
Review the summary of your instance configuration in the Summary panel and, when you’re ready, choose Launch instance.
A confirmation page lets you know that your instance is launching. Choose View all instances to close the confirmation page and return to the console.
Next, complete the following steps to deploy Llama 3.2-3B using the Hugging Face Text Generation Inference (TGI) as the model server:
Connect by using Secure Shell (SSH) into the instance
Start the docker service using the following command. This comes preinstalled with the AMI we selected.
Run the following command to download and run the Docker image for TGI server as well as Llama 3.2-3B model. In our deployment, we used Docker image version 2.4.0, but results might vary based on your selected version. The full list of supported models by TGI can be found at Hugging Face Supported Models. For more details about the deployment and optimization of TGI, refer to this text-generation-inference GitHub page.
After the TGI container is running, you can test your endpoint by running the following command from your local environment: