Blog

Governance by design: The essential guide for successful AI scaling

Governance by design: The essential guide for successful AI scaling

Picture this: Your enterprise has just deployed its first generative AI application. The initial results are promising, but as you plan to scale across departments, critical questions emerge. How will you enforce consistent security, prevent model bias, and maintain control as AI applications multiply? It turns out you’re not alone. A McKinsey survey spanning 750+ …

Governance by design: The essential guide for successful AI scaling Read More »

How Tata Power CoE built a scalable AI-powered solar panel inspection solution with Amazon SageMaker AI and Amazon Bedrock

How Tata Power CoE built a scalable AI-powered solar panel inspection solution with Amazon SageMaker AI and Amazon Bedrock

This post is co-written with Vikram Bansal from Tata Power, and Gaurav Kankaria, Omkar Dhavalikar from Oneture. The global adoption of solar energy is rapidly increasing as organizations and individuals transition to renewable energy sources. India is on the brink of a solar energy revolution, with a national goal to empower 10 million households with …

How Tata Power CoE built a scalable AI-powered solar panel inspection solution with Amazon SageMaker AI and Amazon Bedrock Read More »

Unlocking video understanding with TwelveLabs Marengo on Amazon Bedrock

Unlocking video understanding with TwelveLabs Marengo on Amazon Bedrock

Media and entertainment, advertising, education, and enterprise training content combines visual, audio, and motion elements to tell stories and convey information, making it far more complex than text where individual words have clear meanings. This creates unique challenges for AI systems that need to understand video content. Video content is multidimensional, combining visual elements (scenes, …

Unlocking video understanding with TwelveLabs Marengo on Amazon Bedrock Read More »

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

Foundation model training has reached an inflection point where traditional checkpoint-based recovery methods are becoming a bottleneck to efficiency and cost-effectiveness. As models grow to trillions of parameters and training clusters expand to thousands of AI accelerators, even minor disruptions can result in significant costs and delays. In this post, we introduce checkpointless training on …

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery Read More »

Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod

Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod

Modern AI infrastructure serves multiple concurrent workloads on the same cluster, from foundation model (FM) pre-training and fine-tuning to production inference and evaluation. In this shared environment, the demands for AI accelerators fluctuates continuously as inference workloads scale with traffic patterns, and experiments complete and release resources. Despite this dynamic availability of AI accelerators, traditional …

Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod Read More »

Customize agent workflows with advanced orchestration techniques using Strands Agents

Customize agent workflows with advanced orchestration techniques using Strands Agents

Large Language Model (LLM) agents have revolutionized how we approach complex, multi-step tasks by combining the reasoning capabilities of foundation models with specialized tools and domain expertise. While single-agent systems using frameworks like ReAct work well for straightforward tasks, real-world challenges often require multiple specialized agents working in coordination. Think about planning a business trip: …

Customize agent workflows with advanced orchestration techniques using Strands Agents Read More »

Operationalize generative AI workloads and scale to hundreds of use cases with Amazon Bedrock – Part 1: GenAIOps

Operationalize generative AI workloads and scale to hundreds of use cases with Amazon Bedrock – Part 1: GenAIOps

Enterprise organizations are rapidly moving beyond generative AI experiments to production deployments and complex agentic AI solutions, facing new challenges in scaling, security, governance, and operational efficiency. This blog post series introduces generative AI operations (GenAIOps), the application of DevOps principles to generative AI solutions, and demonstrates how to implement it for applications powered by …

Operationalize generative AI workloads and scale to hundreds of use cases with Amazon Bedrock – Part 1: GenAIOps Read More »

Applying data loading best practices for ML training with Amazon S3 clients

Applying data loading best practices for ML training with Amazon S3 clients

Amazon Simple Storage Service (Amazon S3) is a highly elastic service that automatically scales with application demand, offering the high throughput performance required for modern ML workloads. High-performance client connectors such as the Amazon S3 Connector for PyTorch and Mountpoint for Amazon S3 provide native S3 integration in training pipelines without dealing directly with the …

Applying data loading best practices for ML training with Amazon S3 clients Read More »

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.

Scroll to Top