Blog

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce a new feature for Amazon SageMaker inference endpoints: the ability to scale SageMaker inference endpoints to zero instances. This long-awaited capability is a game changer for our customers using the power of AI and machine learning (ML) inference in the cloud. Previously, SageMaker inference endpoints …

Unlock cost savings with the new scale down to zero feature in SageMaker Inference Read More »

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce the new Container Caching capability in Amazon SageMaker, which significantly reduces the time required to scale generative AI  models for inference. This innovation allows you to scale your models faster, observing up to 56% reduction in latency when scaling a new model copy and up …

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference Read More »

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

The generative AI landscape has been rapidly evolving, with large language models (LLMs) at the forefront of this transformation. These models have grown exponentially in size and complexity, with some now containing hundreds of billions of parameters and requiring hundreds of gigabytes of memory. As LLMs continue to expand, AI engineers face increasing challenges in …

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1 Read More »

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

In Part 1 of this series, we introduced Amazon SageMaker Fast Model Loader, a new capability in Amazon SageMaker that significantly reduces the time required to deploy and scale large language models (LLMs) for inference. We discussed how this innovation addresses one of the major bottlenecks in LLM deployment: the time required to load massive models …

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2 Read More »

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Chronos-Bolt is the newest addition to AutoGluon-TimeSeries, delivering accurate zero-shot forecasting up to 250 times faster than the original Chronos models [1]. Time series forecasting plays a vital role in guiding key business decisions across industries such as retail, energy, finance, and healthcare. Traditionally, forecasting has relied on statistical models [2] like ETS and ARIMA, …

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon Read More »

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

Today, the Accounts Payable (AP) and Accounts Receivable (AR) analysts in Amazon Finance operations receive queries from customers through email, cases, internal tools, or phone. When a query arises, analysts must engage in a time-consuming process of reaching out to subject matter experts (SMEs) and go through multiple policy documents containing standard operating procedures (SOPs) …

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock Read More »

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

We are excited to announce the availability of Cohere’s advanced reranking model Rerank 3.5 through our new Rerank API in Amazon Bedrock. This powerful reranking model enables AWS customers to significantly improve their search relevance and content ranking capabilities. This model is also available for Amazon Bedrock Knowledge Base users. By incorporating Cohere’s Rerank 3.5 …

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API Read More »

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference

The new efficient multi-adapter inference feature of Amazon SageMaker unlocks exciting possibilities for customers using fine-tuned models. This capability integrates with SageMaker inference components to allow you to deploy and manage hundreds of fine-tuned Low-Rank Adaptation (LoRA) adapters through SageMaker APIs. Multi-adapter inference handles the registration of fine-tuned adapters with a base model and dynamically …

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference Read More »

Scroll to Top