admin

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

14 March, 2026
Blog
1

EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you speculate, the more sequential forward passes the drafter needs. Eventually those overhead eats into your gains. P-EAGLE removes this ceiling by generating all K draft tokens in a …

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM Read More »

13 March, 2026
Blog
8

As organizations scale their generative AI workloads on Amazon Bedrock, operational visibility into inference performance and resource consumption becomes critical. Teams running latency-sensitive applications must understand how quickly models begin generating responses. Teams managing high-throughput workloads must understand how their requests consume quota so they can avoid unexpected throttling. Until now, gaining this visibility required …

Improve operational visibility for inference workloads on Amazon Bedrock with new CloudWatch metrics for TTFT and Estimated Quota Consumption Read More »

13 March, 2026
Blog
1

Deploying AI agents safely in regulated industries is challenging. Without proper boundaries, agents that access sensitive data or execute transactions can pose significant security risks. Unlike traditional software, an AI agent chooses actions to achieve a goal by invoking tools, accessing data, and adapting its reasoning using data from its environment and users. This autonomy …

Secure AI agents with Policy in Amazon Bedrock AgentCore Read More »

12 March, 2026
Blog
5

This post shows you how to build a scalable multimodal video search system that enables natural language search across large video datasets using Amazon Nova models and Amazon OpenSearch Service. You will learn how to move beyond manual tagging and keyword-based searches to enable semantic search that captures the full richness of video content. We …

Multimodal embeddings at scale: AI data lake for media and entertainment workloads Read More »

12 March, 2026
Blog
1

This post is a collaboration between AWS, NVIDIA and Heidi. Automatic speech recognition (ASR), often called speech-to-text (STT) is becoming increasingly critical across industries like healthcare, customer service, and media production. While pre-trained models offer strong capabilities for general speech, fine-tuning for specific domains and use cases can enhance accuracy and performance. In this post, …

Fine-tuning NVIDIA Nemotron Speech ASR on Amazon EC2 for domain adaptation Read More »

Please wait...

adminOffline

Posts

Comments

Views

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Improve operational visibility for inference workloads on Amazon Bedrock with new CloudWatch metrics for TTFT and Estimated Quota Consumption

Secure AI agents with Policy in Amazon Bedrock AgentCore

Multimodal embeddings at scale: AI data lake for media and entertainment workloads

Fine-tuning NVIDIA Nemotron Speech ASR on Amazon EC2 for domain adaptation

Media

Recent Posts

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Improve operational visibility for inference workloads on Amazon Bedrock with new CloudWatch metrics for TTFT and Estimated Quota Consumption

Secure AI agents with Policy in Amazon Bedrock AgentCore

Media

Recent Posts

Sign In

Register

Reset Password