Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Large language models (LLMs) excel at generating human-like text but face a critical challenge: hallucination—producing responses that sound convincing but are factually incorrect. While these models are trained on vast amounts of generic data, they often lack the organization-specific context and up-to-date information needed for accurate responses in business settings. Retrieval Augmented Generation (RAG) techniques help address this by grounding LLMs in relevant data during inference, but these models can still generate non-deterministic outputs and occasionally fabricate information even when given accurate source material. For organizations deploying LLMs in production applications—particularly in critical domains such as healthcare, finance, or legal services—these residual hallucinations pose serious risks, potentially leading to misinformation, liability issues, and loss of user trust.

To address these challenges, we introduce a practical solution that combines the flexibility of LLMs with the reliability of drafted, curated, verified answers. Our solution uses two key Amazon Bedrock services: Amazon Bedrock Knowledge Bases, a fully managed service that you can use to store, search, and retrieve organization-specific information for use with LLMs; and Amazon Bedrock Agents, a fully managed service that you can use to build, test, and deploy AI assistants that can understand user requests, break them down into steps, and execute actions. Similar to how a customer service team maintains a bank of carefully crafted answers to frequently asked questions (FAQs), our solution first checks if a user’s question matches curated and verified responses before letting the LLM generate a new answer. This approach helps prevent hallucinations by using trusted information whenever possible, while still allowing the LLM to handle new or unique questions. By implementing this technique, organizations can improve response accuracy, reduce response times, and lower costs. Whether you’re new to AI development or an experienced practitioner, this post provides step-by-step guidance and code examples to help you build more reliable AI applications.

Solution overview

Our solution implements a verified semantic cache using the Amazon Bedrock Knowledge Bases Retrieve API to reduce hallucinations in LLM responses while simultaneously improving latency and reducing costs. This read-only semantic cache acts as an intelligent intermediary layer between the user and Amazon Bedrock Agents, storing curated and verified question-answer pairs.

When a user submits a query, the solution first evaluates its semantic similarity with existing verified questions in the knowledge base. For highly similar queries (greater than 80% match), the solution bypasses the LLM completely and returns the curated and verified answer directly. When partial matches (60–80% similarity) are found, the solution uses the verified answers as few-shot examples to guide the LLM’s response, significantly improving accuracy and consistency. For queries with low similarity (less than 60%) or no match, the solution falls back to standard LLM processing, making sure that user questions receive appropriate responses.

This approach offers several key benefits:

Reduced costs: By minimizing unnecessary LLM invocations for frequently answered questions, the solution significantly reduces operational costs at scale
Improved accuracy: Curated and verified answers minimize the possibility of hallucinations for known user queries, while few-shot prompting enhances accuracy for similar questions.
Lower latency: Direct retrieval of cached answers provides near-instantaneous responses for known queries, improving the overall user experience.

The semantic cache serves as a growing repository of trusted responses, continuously improving the solution’s reliability while maintaining efficiency in handling user queries.

Solution architecture

The solution architecture in the preceding figure consists of the following components and workflow. Let’s assume that the question “What date will AWS re:invent 2024 occur?” is within the verified semantic cache. The corresponding answer is also input as “AWS re:Invent 2024 takes place on December 2–6, 2024.” Let’s walkthrough an example of how this solution would handle a user’s question.

1. Query processing:

a. User submits a question “When is re:Invent happening this year?”, which is received by the Invoke Agent function.

b. The function checks the semantic cache (Amazon Bedrock Knowledge Bases) using the Retrieve API.

c. Amazon Bedrock Knowledge Bases performs a semantic search and finds a similar question with an 85% similarity score.

2. Response paths: (Based on the 85% similarity score in step 1.c, our solution follows the strong match path)

a. Strong match (similarity score greater than 80%):

i. Invoke Agent function returns exactly the verified answer “AWS re:Invent 2024 takes place on December 2–6, 2024” directly from the Amazon Bedrock knowledge base, providing a deterministic response.

ii. No LLM invocation needed, response in less than 1 second.

b. Partial match (similarity score 60–80%):

i. The Invoke Agent function invokes the Amazon Bedrock agent and provides the cached answer as a few-shot example for the agent through Amazon Bedrock Agents promptSessionAttributes.

ii. If the question was “What’s the schedule for AWS events in December?”, our solution would provide the verified re:Invent dates to guide the Amazon Bedrock agent’s response with additional context.

iii. Providing the Amazon Bedrock agent with a curated and verified example might help increase accuracy.

c. No match (similarity score less than 60%):

i. If the user’s question isn’t similar to any of the curated and verified questions in the cache, the Invoke Agent function invokes the Amazon Bedrock agent without providing it any additional context from cache.

ii. For example, if the question was “What hotels are near re:Invent?”, our solution would invoke the Amazon Bedrock agent directly, and the agent would use the tools at its disposal to formulate a response.

3. Offline knowledge management:

a. Verified question-answer pairs are stored in a verified Q&A Amazon S3 bucket (Amazon Simple Storage Service), and must be updated or reviewed periodically to make sure that the cache contains the most recent and accurate information.

b. The S3 bucket is periodically synchronized with the Amazon Bedrock knowledge base. This offline batch process makes sure that the semantic cache remains up-to-date without impacting real-time operations.

Solution walkthrough

You need to meet the following prerequisites for the walkthrough:

An AWS account
Model access to Anthropic’s Claude Sonnet V1 and Amazon Titan Text Embedding V2
AWS Command Line Interface (AWS CLI) installed and configured with the appropriate credentials

Once you have the prerequisites in place, use the following steps to set up the solution in your AWS account.

Step 0: Set up the necessary infrastructure

Follow the “Getting started” instructions in the README of the Git repository to set up the infrastructure for this solution. All the following code samples are extracted from the Jupyter notebook in this repository.

Step 1: Set up two Amazon Bedrock knowledge bases

This step creates two Amazon Bedrock knowledge bases. The agent knowledge base stores Amazon Bedrock service documentation, while the cache knowledge base contains curated and verified question-answer pairs. This setup uses the AWS SDK for Python (Boto3) to interact with AWS services.

agent_knowledge_base = BedrockKnowledgeBase(
kb_name=agent_knowledge_base_name,
kb_description=”Knowledge base used by Bedrock Agent”,
data_bucket_name=agent_bucket_name,
chunking_strategy=”FIXED_SIZE”,
suffix=f'{agent_unique_id}-f’
)

cache_knowledge_base = BedrockKnowledgeBase(
kb_name=cache_knowledge_base_name,
kb_description=”Verified cache for Bedrock Agent System”,
data_bucket_name=cache_bucket_name,
chunking_strategy=”NONE”, # We do not want to chunk our question-answer pairs
suffix=f'{cache_unique_id}-f’
)

This establishes the foundation for your semantic caching solution, setting up the AWS resources to store the agent’s knowledge and verified cache entries.

Step 2: Populate the agent knowledge base and associate it with an Amazon Bedrock agent

For this walkthrough, you will create an LLM Amazon Bedrock agent specialized in answering questions about Amazon Bedrock. For this example, you will ingest Amazon Bedrock documentation in the form of the User Guide PDF into the Amazon Bedrock knowledge base. This will be the primary dataset. After ingesting the data, you create an agent with specific instructions:

agent_instruction = “””You are the Amazon Bedrock Agent. You have access to a
knowledge base with information about the Amazon Bedrock service on AWS.
Use it to answer questions.”””

agent_id = agents_handler.create_agent(
agent_name,
agent_description,
agent_instruction,
[agent_foundation_model],
kb_arns=[agent_kb_arn] # Associate agent with our Agent knowledge base
)

This setup enables the Amazon Bedrock agent to use the ingested knowledge to provide responses about Amazon Bedrock services. To test it, you can ask a question that isn’t present in the agent’s knowledge base, making the LLM either refuse to answer or hallucinate.

invoke_agent(“What are the dates for reinvent 2024?”, session_id=”test”)
# Response: Unfortunately, the dates for the AWS re:Invent 2024 conference have not
# been announced yet by Amazon. The re:Invent conference is typically held in late
# November or early December each year, but the specific dates for 2024 are not
# available at this time. AWS usually announces the dates for their upcoming
# re:Invent event around 6-9 months in advance.

Step 3: Create a cache dataset with known question-answer pairs and populate the cache knowledge base

In this step, you create a raw dataset of verified question-answer pairs that aren’t present in the agent knowledge base. These curated and verified answers serve as our semantic cache to prevent hallucinations on known topics. Good candidates for inclusion in this cache are:

Frequently asked questions (FAQs): Common queries that users often ask, which can be answered consistently and accurately.
Critical questions requiring deterministic answers: Topics where precision is crucial, such as pricing information, service limits, or compliance details.
Time-sensitive information: Recent updates, announcements, or temporary changes that might not be reflected in the main RAG knowledge base.

By carefully curating this cache with high-quality, verified answers to such questions, you can significantly improve the accuracy and reliability of your solution’s responses. For this walkthrough, use the following example pairs for the cache:

Q: ‘What are the dates for reinvent 2024?’
A: ‘The AWS re:Invent conference was held from December 2-6 in 2024.’

Q: ‘What was the biggest new feature announcement for Bedrock Agents during reinvent 2024?’
A: ‘During re:Invent 2024, one of the headline new feature announcements for Bedrock Agents was the custom orchestrator. This key feature allows users to implement their own orchestration strategies through AWS Lambda functions, providing granular control over task planning, completion, and verification while enabling real-time adjustments and reusability across multiple agents.’

You then format these pairs as individual text files with corresponding metadata JSON files, upload them to an S3 bucket, and ingest them into your cache knowledge base. This process makes sure that your semantic cache is populated with accurate, curated, and verified information that can be quickly retrieved to answer user queries or guide the agent’s responses.

Step 4: Implement the verified semantic cache logic

In this step, you implement the core logic of your verified semantic cache solution. You create a function that integrates the semantic cache with your Amazon Bedrock agent, enhancing its ability to provide accurate and consistent responses.

Queries the cache knowledge base for similar entries to the user question.
If a high similarity match is found (greater than 80%), it returns the cached answer directly.
For partial matches (60–80%), it uses the cached answer as a few-shot example for the agent.
For low similarity (less than 60%), it falls back to standard agent processing.

This simplified logic forms the core of the semantic caching solution, efficiently using curated and verified information to improve response accuracy and reduce unnecessary LLM invocations.

Step 5: Evaluate results and performance

This step demonstrates the effectiveness of the verified semantic cache solution by testing it with different scenarios and comparing the results and latency. You’ll use three test cases to showcase the solution’s behavior:

Strong semantic match (greater than 80% similarity)
Partial semantic match (60-80% similarity)
No semantic match (less than 60% similarity)

Here are the results:

Strong semantic match (greater than 80% similarity) provides the exact curated and verified answer in less than 1 second.

%%time
invoke_agent_with_verified_cache(“What were some new features announced for Bedrock Agents during reinvent 2024?”)

# Output:
# Cache semantic similarity log: Strong match with score 0.9176399
# CPU times: user 20.7 ms, sys: 442 μs, total: 21.1 ms
# Wall time: 440 ms

# During re:Invent 2024, one of the headline new feature announcements for Bedrock
# Agents was the custom orchestrator. This key feature allows users to implement
# their own orchestration strategies through AWS Lambda functions, providing
# granular control over task planning, completion, and verification while enabling
# real-time adjustments and reusability across multiple agents.

Partial semantic match (60–80% similarity) passes the verified answer to the LLM during the invocation. The Amazon Bedrock agent answers the question correctly using the cached answer even though the information is not present in the agent knowledge base.

%%time
invoke_agent_with_verified_cache(“What are the newest features for Bedrock Agents?”)

# Output:
# Cache semantic similarity log: Partial match with score 0.6443664
# CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
# Wall time: 12.8 s

# One of the newest and most significant features for Amazon Bedrock Agents
# announced during re:Invent 2024 was the custom orchestrator. This feature
# allows users to implement their own orchestration strategies through AWS
# Lambda functions, providing granular control over task planning, completion,
# and verification. It enables real-time adjustments and reusability across
# multiple agents, enhancing the flexibility and power of Bedrock Agents.

No semantic match (less than 60% similarity) invokes the Amazon Bedrock agent as usual. For this query, the LLM will either refuse to provide the information because it’s not present in the agent’s knowledge base, or will hallucinate and provide a response that is plausible but incorrect.

%%time
invoke_agent_with_verified_cache(“Tell me about a new feature for Amazon Bedrock Agents”)

# Output:
# Cache semantic similarity log: No match with score 0.532105
# CPU times: user 22.3 ms, sys: 579 μs, total: 22.9 ms
# Wall time: 13.6 s

# Amazon Bedrock is a service that provides secure and scalable compute capacity
# for running applications on AWS. As for new features for the Bedrock Agents
# component, I do not have any specific information on recent or upcoming new
# features. However, AWS services are frequently updated with new capabilities,
# so it’s possible there could be new agent features released in the future to
# enhance security, scalability, or integration with other AWS services. Without
# being able to consult the Knowledge Base, I cannot provide details on any
# particular new Bedrock Agent features at this time.

These results demonstrate the effectiveness of the semantic caching solution:

Strong matches provide near-instant, accurate, and deterministic responses without invoking an LLM.
Partial matches guide the LLM agent to provide a more relevant or accurate answer.
No matches fall back to standard LLM agent processing, maintaining flexibility.

The semantic cache significantly reduces latency for known questions and improves accuracy for similar queries, while still allowing the agent to handle unique questions when necessary.

Step 6: Resource clean up

Make sure that the Amazon Bedrock knowledge bases that you created, along with the underlying Amazon OpenSearch Serverless collections are deleted to avoid incurring unnecessary costs.

Production readiness considerations

Before deploying this solution in production, address these key considerations:

Similarity threshold optimization: Experiment with different thresholds to balance cache hit rates and accuracy. This directly impacts the solution’s effectiveness in preventing hallucinations while maintaining relevance.
Feedback loop implementation: Create a mechanism to continuously update the verified cache with new, accurate responses. This helps prevent cache staleness and maintains the solution’s integrity as a source of truth for the LLM.
Cache management and update strategy: Regularly refresh the semantic cache with current, frequently asked questions to maintain relevance and improve hit rates. Implement a systematic process for reviewing, validating, and incorporating new entries to help ensure cache quality and alignment with evolving user needs.
Ongoing tuning: Adjust similarity thresholds as your dataset evolves. Treat the semantic cache as a dynamic component, requiring continuous optimization for your specific use case.

Conclusion

This verified semantic cache approach offers a powerful solution to reduce hallucinations in LLM responses while improving latency and reducing costs. By using Amazon Bedrock Knowledge Bases, you can implement a solution that can efficiently serve curated and verified answers, guide LLM responses with few-shot examples, and gracefully fall back to full LLM processing when needed.

About the Authors

Dheer Toprani is a System Development Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazon’s operations. Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines.

Chaithanya Maisagoni is a Senior Software Development Engineer (AI/ML) in Amazon’s Worldwide Returns and ReCommerce organization. He specializes in building scalable machine learning infrastructure, distributed systems, and containerization technologies. His expertise lies in developing robust solutions that enhance monitoring, streamline inference processes, and strengthen audit capabilities to support and optimize Amazon’s global operations.

Rajesh Nedunuri is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in designing, building, and optimizing large-scale data solutions. At Amazon, he plays a key role in developing scalable data pipelines, improving data quality, and enabling actionable insights for reverse logistics and ReCommerce operations. He is deeply passionate about generative AI and consistently seeks opportunities to implement AI into solving complex customer challenges.

Karam Muppidi is a Senior Engineering Manager at Amazon Retail, where he leads data engineering, infrastructure and analytics for the Worldwide Returns and ReCommerce organization. He has extensive experience developing enterprise-scale data architectures and governance strategies using both proprietary and native AWS platforms, as well as third-party tools. Previously, Karam developed big-data analytics applications and SOX compliance solutions for Amazon’s Fintech and Merchant Technologies divisions.