Ground truth generation and review best practices for evaluating generative AI question-answering with FMEval

Generative AI question-answering applications are pushing the boundaries of enterprise productivity. These assistants can be powered by various backend architectures including Retrieval Augmented Generation (RAG), agentic workflows, fine-tuned large language models (LLMs), or a combination of these techniques. However, building and deploying trustworthy AI assistants requires a robust ground truth and evaluation framework.

Ground truth data in AI refers to data that is known to be factual, representing the expected use case outcome for the system being modeled. By providing an expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Running deterministic evaluation of generative AI assistants against use case ground truth data enables the creation of custom benchmarks. These benchmarks are essential for tracking performance drift over time and for statistically comparing multiple assistants in accomplishing the same task. Additionally, they enable quantifying performance changes as a function of enhancements to the underlying assistant, all within a controlled setting. With deterministic evaluation processes such as the Factual Knowledge and QA Accuracy metrics of FMEval, ground truth generation and evaluation metric implementation are tightly coupled. To ensure the highest quality measurement of your question answering application against ground truth, the evaluation metric’s implementation must inform ground truth curation.

In this post, we discuss best practices for applying LLMs to generate ground truth for evaluating question-answering assistants with FMEval on an enterprise scale. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, and provides standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, see Evaluate large language models for quality and responsibility of LLMs. Additionally, see the Generative AI Security Scoping Matrix for guidance on moderating confidential and personally identifiable information (PII) as part of your generative AI solution.

By following these guidelines, data teams can implement high fidelity ground truth generation for question-answering use case evaluation with FMEval. For ground truth curation best practices for question answering evaluations with FMEval that you can use to design FMEval ground truth prompt templates, see Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval.

Generating ground truth for FMEval question-answering evaluation

One option to get started with ground truth generation is human curation of a small question-answer dataset. The human curated dataset should be small (based on bandwidth), high in signal, and ideally prepared by use case subject matter experts (SMEs). The exercise of generating this dataset forces a data alignment exercise early in the evaluation process, raising important questions and conversations among use case stakeholders about what questions are important to measure over time for the business. The outcomes for this exercise are three-fold:

Stakeholder alignment on the top N important questions
Stakeholder awareness of the evaluation process
A high-fidelity starter ground truth dataset for the first proof of concept evaluation as a function of awareness and evaluation

While an SME ground truth curation exercise is a strong start, at the scale of an enterprise knowledge base, pure SME generation of ground truth will become prohibitively time and resource intensive. To scale ground truth generation and curation, you can apply a risk-based approach in conjunction with a prompt-based strategy using LLMs. It’s important to note that LLM-generated ground truth isn’t a substitute for use case SME involvement. For example, if ground truth is generated by LLMs before the involvement of SMEs, SMEs will still be needed to identify which questions are fundamental to the business and then align the ground truth with business value as part of a human-in-the-loop process.

To demonstrate, we provide a step-by-step walkthrough using Amazon’s 2023 letter to shareholders as source data.

In keeping with ground truth curation best practices for FMEval question-answering, ground truth is curated as question-answer-fact triplets. The question and answer are curated to suit the ideal question-answering assistant response in terms of content, length, and style. The fact is a minimal representation of the ground truth answer, comprising one or more subject entities of the question.

For example, consider how the following source document chunk from the Amazon 2023 letter to shareholders can be converted to question-answering ground truth.

Dear Shareholders:

Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more. The reasons are many, but start with the progress we’ve made in our financial results and customer experiences, and extend to our continued innovation and the remarkable opportunities in front of us. In 2023, Amazon’s total revenue grew 12% year-over-year (“Y oY”) from $514B to $575B. By segment, North America revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% Y oY from$118B to $131B, and AWS revenue increased 13% Y oY from $80B to $91B. Further, Amazon’s operating income and Free Cash Flow (“FCF”) dramatically improved. Operating income in 2023 improved 201% YoY from $12.2B (an operating margin of 2.4%) to $36.9B (an operating margin of 6.4%).

To convert the source document excerpt into ground truth, we provide a base LLM prompt template. In the template, we instruct the LLM to take a fact-based approach to interpreting the chunk using chain-of-thought logic. For our example, we work with Anthropic’s Claude LLM on Amazon Bedrock. The template is compatible with and can be modified for other LLMs, such as LLMs hosted on Amazon Sagemaker Jumpstart and self-hosted on AWS infrastructure. To modify the prompt for use by other LLMs, a different approach to denoting prompt sections than XML tags might be required. For example, Meta Llama models apply tags such as <s> [INST] and <<SYS>>. For more information, see the Amazon Bedrock documentation on LLM prompt design and the FMEval documentation.

The LLM is assigned a persona to set its point of view for carrying out the task. In the instructions, the LLM identifies facts as entities from the source document chunk. For each fact, a question-answer-fact triplet is assembled based on the fact detected and its surrounding context. In the prompt, we provide detailed examples for controlling the content of ground truth questions. The examples focus on questions on chunk-wise business knowledge while ignoring irrelevant metadata that might be contained in a chunk. You can customize the prompt examples to fit your ground truth use case.

We further instruct the LLM to apply ground truth curation best practices for FMEval, such as generating multiple variations of facts to fit multiple possible unit expressions. Additional curation elements subject to the task at hand—such as brand language and tone—can be introduced into the ground truth generation prompt. With the following template, we verified that Anthropic’s Claude Sonnet 3.5 can generate custom ground truth attributes accommodating FMEval features, such as the <OR> delimiter to denote alternative acceptable answers for a ground truth fact.

“””You are an expert in ground truth curation for generative AI application evaluation on AWS.

Follow the instructions provided in the <instructions> XML tag for generating question answer fact triplets from a source document excerpt.

<instructions>
– Let’s work this out in a step-by-step way to be sure we have the right answer.
– Review the source document excerpt provided in <document> XML tags below
– For each meaningful domain fact in the <document>, extract an unambiguous question-answer-fact set in JSON format including a question and answer pair encapsulating the fact in the form of a short sentence, followed by a minimally expressed fact extracted from the answer.

<domain_knowledge_focus>
– Focus ONLY on substantive domain knowledge contained within the document content
– Ignore all metadata and structural elements including but not limited to:
– Document dates, versions, page numbers
– Section numbers or titles
– Table structure or row/column positions
– List positions or ordering
– Questions must reference specific domain entities rather than generic document elements
</domain_knowledge_focus>

<context_specification_requirements>
Document Source Identification
– Always reference the specific source document and its date/version
– Example: “According to the [Document Name + Date], what is [specific query]?”

Cross-Reference Prevention
– Each question must be answerable from the current document chunk only
– Do not create questions requiring information from multiple documents
– Example: “In this [Document Name], what are [specific requirements]?”

Department/LOB Specification
– Always specify the relevant department, line of business, or organizational unit
– Example: “What are the [Department Name]’s requirements for [specific process]?”

Document Section Targeting
– Reference specific sections when the information location is relevant
– Example: “In Section [X] of [Document Name], what are the steps for [specific process]?”

Role-Based Context
– Specify relevant roles, responsibilities, or authority levels
– Example: “Which [specific roles] are authorized to [specific action]?”

Version Control Elements
– Include relevant version or revision information
– Example: “What changes were implemented in the [Month Year] revision of [Document]?”

Policy/Procedure Numbers
– Include specific policy or procedure reference numbers
– Example: “Under Policy [Number], what are the requirements for [specific action]?”

Regulatory Framework References
– Specify relevant regulatory frameworks or compliance requirements
– Example: “What [Regulation] compliance requirements are specified for [specific process]?”

System/Platform Specification
– Name specific systems, platforms, or tools
– Example: “What steps are required in [System Name] to [specific action]?”

Document Type Classification
– Specify the type of document (SOP, Policy, Manual, etc.)
– Example: “In the [Document Type + Number], where is [specific information] stored?”

Temporal Validity
– Include effective dates or time periods
– Example: “What process is effective from [Date] according to [Document]?”

Geographic Jurisdiction
– Specify relevant geographic regions or jurisdictions
– Example: “What requirements apply to [Region] according to [Document]?”

Business Process Owner
– Identify relevant process owners or responsible parties
– Example: “According to [Document], who owns the process for [specific action]?”

Classification Level
– Include relevant security or confidentiality classifications
– Example: “What are the requirements for [Classification Level] data?”

Stakeholder Scope
– Specify relevant stakeholders or approval authorities
– Example: “Which [stakeholder level] must approve [specific action]?”
</context_specification_requirements>

<question_quality_criteria>
– Questions must be specific enough that a vector database can match them to the relevant document chunk
– Questions should include key identifying terms, names, and context
– Questions should target concrete, actionable information
– Answers should provide complete context without referring back to document elements
</question_quality_criteria>

<output_format>
The question-answer-fact set should each be a short string in JSON format with the keys: “question”, “ground_truth_answer”, “fact”
</output_format>

<best_practices>
– Questions, answers, and facts should not refer to the subject entity as “it” or “they”, and instead refer to it directly by name
– Questions, answers, and facts should be individually unique to the document chunk, such that based on the question a new call to the retriever will address the correct document section when posing the ground truth question
– Facts should be represented in 3 or fewer words describing an entity in the <document>
– If there are units in the fact, the “fact” entry must provide multiple versions of the fact using <OR> as a delimiter. See <unit_variations> for examples.
<unit_variations>
– Dollar Unit Equivalencies: `1,234 million<OR>1.234 billion`
– Date Format Equivalencies: `2024-01-01<OR>January 1st 2024`
– Number Equivalencies: `1<OR>one`
</unit_variations>
</best_practices>

– Start your response immediately with the question-answer-fact set JSON, and separate each extracted JSON record with a newline.
</instructions>

<document>
{context_document}
</document>

Now, extract the question answer pairs and fact from the document excerpt according to your instructions, starting immediately with JSON and no preamble.”””

The generation output is provided as fact-wise JSONLines records in the following format, where elements in square brackets represent values from a line in Table 1.

{

“question”: “[Question]”,

“ground_truth_answer”: “[Ground Truth Answer]”,

“fact”: “[Fact]”

}

Here are a few examples of generated ground truth:

Question
Ground Truth Answer
Fact

What was Amazon’s total revenue growth in 2023?
Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.
12%<OR>$514B to $575B

How much did North America revenue increase in 2023?
North America revenue increased 12% year-over-year from $316B to $353B.
12%<OR>$316B to $353B

What was the growth in International revenue for Amazon in 2023?
International revenue grew 11% year-over-year from $118B to $131B.
11%<OR>$118B to $131B

How much did AWS revenue increase in 2023?
AWS revenue increased 13% year-over-year from $80B to $91B.
13%<OR>$80B to $91B

What was Amazon’s operating income improvement in 2023?
Operating income in 2023 improved 201% year-over-year from $12.2B to $36.9B.
201%<OR>$12.2B to $36.9B

What was Amazon’s operating margin in 2023?
Amazon’s operating margin in 2023 was 6.4%.
6.4%

Scaling ground truth generation with a pipeline

To automate ground truth generation, we provide a serverless batch pipeline architecture, shown in the following figure. At a high level, the AWS Step Functions pipeline accepts source data in Amazon Simple Storage Service (Amazon S3), and orchestrates AWS Lambda functions for ingestion, chunking, and prompting on Amazon Bedrock to generate the fact-wise JSONLines ground truth.

There are three user inputs to the step function:

A custom name for the ground truth dataset
The input Amazon S3 prefix for the source data
The percentage to sample for review.

Additional configurations are set by Lambda environment variables, such as the S3 source bucket and Amazon Bedrock Model ID to invoke on generation.

{

“dataset_name”: “YOUR_DATASET_NAME”,

“input_prefix”: “YOUR INPUT_PREFIX”,

“review_percentage”: “REVIEW PERCENTAGE”

}

After the initial payload is passed, a validation function assembles the global event payload structure in terms of system input and user input.

{

“system_input”:

{

“run_id”: “<AWS Step Function execution ID>”,

“input_bucket”: “<Input data Amazon S3 bucket>”,

“output_bucket”: “<Output data Amazon S3 bucket>”,

“output_document_chunks_prefix”: “<Amazon S3 bucket Prefix to store chunks>”,

“chunk_size”: “<Document chunk size>”,

“chunk_overlap”: “<Number of tokens that will overlap across consecutive chunks>”

“user_input”:

{

“dataset_name”: “<Dataset name>”,

“input_prefix”: “<Amazon S3 bucket prefix for ground truth generation data input data>”,

“review_percentage”: “<Percent of records to flag for human review>”

}

After validation, the first distributed map state iterates over the files in the input bucket to start the document ingestion and chunking processes with horizontal scaling. The resulting chunks are stored in an intermediate S3 bucket.

The second distributed map is the generation core of the pipeline. Each chunk generated by the previous map is fed as an input to the ground truth generation prompt on Amazon Bedrock. For each chunk, a JSONLines file containing the question-answer-fact triplets is validated and stored in an S3 bucket at the output prefix.

The following figure shows a view of the data structure and lineage from document paragraphs to the final ground truth chunk across the chunking and generation map states. The numbering between the two figures indicates the data structure present at each point in the pipeline. Finally, the JSONLines files are aggregated in an Amazon SageMaker Processing Job, including the assignment of a random sample for human review based on user input.

The last step of the pipeline is the aggregation step using a SageMaker Processing job. The aggregation step consists of concatenating the JSONLines records generated by every child execution of the generation map into a single ground truth output file. A randomly selected percentage of the records in the output file are sampled and flagged for review as part of a human-in-the-loop process.

Judging ground truth for FMEval question-answering evaluation

In this section, we discuss two key components of evaluating ground truth quality: human in the loop and applying an LLM as a Judge. Measuring ground truth quality is an essential component of the evaluation lifecycle.

Human-in-the-loop

The level of ground truth human review required is determined by the risk of having incorrect ground truth, and its negative implications. Ground truth review by use case SMEs can verify if critical business logic is appropriately represented by the ground truth. The process of ground truth review by humans is called human-in-the-loop (HITL), and an example the HITL process is shown in the following figure.

The steps of HTIL are:

Classify risk: performing a risk analysis will establish the severity and likelihood of negative events occurring as a result of incorrect ground truth used for evaluation of a generative AI use-case. Based on the outcome of the analysis, assign the ground truth dataset a risk level: Low, Medium, High or Critical. The table below outlines the relationship between event severity, likelihood, and risk level. See Learn how to assess the risk of AI systems for a deep dive on performing AI risk assessment.
Human review: Based on the assigned risk level, use-case expert reviewers examine a proportional amount of the use-case ground truth. Organizations can set acceptability thresholds for percentage of HITL intervention based on their tolerance for risk. Similarly, if a ground truth dataset is promoted from a low risk to a medium risk use case, an increased level of HITL intervention will be necessary.
Identify findings: Reviewers can identify any hallucinations relative to source data, challenges with information veracity according to their expertise, or other criteria set by the organization. In this post, we focus on hallucination detection and information veracity.
Action results: Reviewers can take business actions based on their judgement, such as updating and deleting records, or re-writing applicable source documents. Bringing in LLMOps SMEs to apply dataset curation best practices can also be an outcome.

Putting the risk table from Learn how to assess the risk of AI systems into action, the severity and likelihood of risks for a ground truth dataset validating a production chatbot with frequent customer use would be greater than an internal evaluation dataset used by developers to advance a prototype.

Likelihood

Severity
Rare
Unlikely
Possible
Likely
Frequent

Extreme
Low
Medium
High
Critical
Critical

Major
Very low
Low
Medium
High
Critical

Moderate
Very low
Low
Medium
Medium
High

Low
Very low
Very low
Low
Low
Medium

Very Low
Very low
Very low
Very low
Very low
Low

Next, we walk through the step-by-step process of conducting a human review for hallucination detection and information veracity. Human review is performed by comparing the ground truth chunk input to the LLM prompt to the generated question-answer-fact triplets. This view is shown in the following table.

Source data chunk
Ground truth triplets

Dear Shareholders:

Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more. The reasons are many, but start with the progress we’ve made in our financial results and customer experiences, and extend to our continued innovation and the remarkable opportunities in front of us. In 2023, Amazon’s total revenue grew 12% year-over-year (“YoY”) from $514B to $575B. By segment, North America revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% YoY from $118B to $131B, and AWS revenue increased 13% YoY from $80B to $91B.

{“question”: “What was Amazon’s total revenue growth in 2023?”, “ground_truth_answer”: “Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.”, “fact”: “12%<OR>$514B to $575B”}

{“question”: “How much did North America revenue increase in 2023?”, “ground_truth_answer”: “North America revenue increased 12% year-over-year from $316B to $353B.”, “fact”: “12%<OR>$316B to $353B”}

{“question”: “What was the growth in International revenue for Amazon in 2023?”, “ground_truth_answer”: “International revenue grew 11% year-over-year from $118B to $131B.”, “fact”: “11%<OR>$118B to $131B”}

Human reviewers then identify and take action based on findings to correct the system. LLM hallucination is the phenomenon where LLMs generate plausible-sounding but factually incorrect or nonsensical information, presented confidently as factual. Organizations can introduce additional qualities for evaluating and scoring ground truth, as suited to the risk level and use case requirements.

In hallucination detection, reviewers seek to identify text that has been incorrectly generated by the LLM. An example of hallucination and remediation is shown in the following table. A reviewer would notice in the source data that Amazon’s total revenue grew 12% year over year, yet the ground truth answer hallucinated a 15% figure. In remediation, the reviewer can change this back to 12%.

Source data chunk
Example hallucination
Example hallucination remediation

In 2023, Amazon’s total revenue grew 12% year-over-year (“YoY”) from $514B to $575B.

{“question”: “What was Amazon’s total revenue growth in 2023?”,

“ground_truth_answer”: “Amazon’s total revenue grew 15% year-over-year from $514B to $575B in 2023.”,

“fact”: “12%<OR>$514B to $575B”}

{“question”: “What was Amazon’s total revenue growth in 2023?”,

“ground_truth_answer”: “Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.”,

“fact”: “12%<OR>$514B to $575B”}

In SME review for veracity, reviewers seek to validate if the ground truth is in fact truthful. For example, the source data used for the ground truth generation prompt might be out of date or incorrect. The following table shows the perspective of an HITL review by a domain SME.

Source data chunk
Example SME review
Example hallucination remediations

Effective June 1st, 2023, AnyCompany is pleased to announce the implementation of “Casual Friday” as part of our updated dress code policy. On Fridays, employees are permitted to wear business casual attire, including neat jeans, polo shirts, and comfortable closed-toe shoes.

“As an HR Specialist, this looks incorrect to me.

We did not implement the Casual Friday policy after all at AnyCompany – the source data for this ground truth must be out of date.”

Delete Incorrect Ground Truth
Update Source Data Document
Other use case specific actions

Traditional machine learning applications can also inform the HITL process design. For examples of HITL for traditional machine learning, see Human-in-the-loop review of model explanations with Amazon SageMaker Clarify and Amazon A2I.

LLM-as-a-judge

When scaling HITL, LLM reviewers can perform hallucination detection and remediation. This idea is known as self-reflective RAG, and can be used to decrease—but not eliminate—the level of human effort in the process for hallucination detection. As a means of scaling LLM-as-a-judge review, Amazon Bedrock now offers the ability to use LLM reviewers and to perform automated reasoning checks with Amazon Bedrock Guardrails for mathematically sound self-validation against predefined policies. For more information about implementation, see New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock and Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview).

The following figure shows an example high-level diagram of a self-reflective RAG pattern. A generative AI application based on RAG yields responses fed to a judge application. The judge application reflects on whether responses are incomplete, hallucinated, or irrelevant. Based on the judgement, data is routed along the corresponding remediation.

The golden rule in implementing HITL or LLM-as-a-judge as part of ground truth generation is to make sure the organization’s review process aligns with the accepted risk level for the ground truth dataset.

Conclusion

In this post, we provided guidance on generating and reviewing ground truth for evaluating question-answering applications using FMEval. We explored best practices for applying LLMs to scale ground truth generation while maintaining quality and accuracy. The serverless batch pipeline architecture we presented offers a scalable solution for automating this process across large enterprise knowledge bases. We provide a ground truth generation prompt that you can use to get started with evaluating knowledge assistants using the FMEval Factual Knowledge and QA Accuracy evaluation metrics.

By following these guidelines, organizations can follow responsible AI best practices for creating high-quality ground truth datasets for deterministic evaluation of question-answering assistants. Use case-specific evaluations supported by well-curated ground truth play a crucial role in developing and deploying AI solutions that meet the highest standards of quality and responsibility.

Whether you’re developing an internal tool, a customer-facing virtual assistant, or exploring the potential of generative AI for your organization, we encourage you to adopt these best practices. Start implementing a robust ground truth generation and review processes for your generative AI question-answering evaluations today with FMEval.

About the authors

Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.

Philippe Duplessis-Guindon is a cloud consultant at AWS, where he has worked on a wide range of generative AI projects. He has touched on most aspects of these projects, from infrastructure and DevOps to software development and AI/ML. After earning his bachelor’s degree in software engineering and a master’s in computer vision and machine learning from Polytechnique Montreal, Philippe joined AWS to put his expertise to work for customers. When he’s not at work, you’re likely to find Philippe outdoors—either rock climbing or going for a run.

Rahul Jani is a Data Architect with AWS Professional Service. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Generating ground truth for FMEval question-answering evaluation

Scaling ground truth generation with a pipeline

Judging ground truth for FMEval question-answering evaluation

Human-in-the-loop

LLM-as-a-judge

Conclusion

About the authors

Related Posts

Leave a Comment Cancel Reply

Sign In

Register

Reset Password