Fine-tune LLMs with synthetic data for context-based Q&A using Amazon Bedrock

There’s a growing demand from customers to incorporate generative AI into their businesses. Many use cases involve using pre-trained large language models (LLMs) through approaches like Retrieval Augmented Generation (RAG). However, for advanced, domain-specific tasks or those requiring specific formats, model customization techniques such as fine-tuning are sometimes necessary. Amazon Bedrock provides you with the ability to customize leading foundation models (FMs) such as Anthropic’s Claude 3 Haiku and Meta’s Llama 3.1.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.

Fine-tuning is a supervised training process where labeled prompt and response pairs are used to further train a pre-trained model to improve its performance for a particular use case. One consistent pain point of fine-tuning is the lack of data to effectively customize these models. Gathering relevant data is difficult, and maintaining its quality is another hurdle. Furthermore, fine-tuning LLMs requires substantial resource commitment. In such scenarios, synthetic data generation offers a promising solution. You can create synthetic training data using a larger language model and use it to fine-tune a smaller model, which has the benefit of a quicker turnaround time.

In this post, we explore how to use Amazon Bedrock to generate synthetic training data to fine-tune an LLM. Additionally, we provide concrete evaluation results that showcase the power of synthetic data in fine-tuning when data is scarce.

Solution overview

The solution comprises two main steps:

Generate synthetic data using the Amazon Bedrock InvokeModel API.
Fine-tune using an Amazon Bedrock custom model.

For synthetic data generation, we use a larger language model (such as Anthropic’s Claude 3 Sonnet on Amazon Bedrock) as the teacher model, and a smaller language model (such as Anthropic’s Claude Instant 1.2 or Claude 3 Haiku on Amazon Bedrock) as the student model for fine-tuning. We use the larger teacher model to generate new data based on its knowledge, which is then used to train the smaller student model. This concept is similar to knowledge distillation used in deep learning, except that we’re using the teacher model to generate a new dataset from its knowledge rather than directly modifying the architecture of the student model.

The following diagram illustrates the overall flow of the solution.

Finally, we share our experiment results, where we compare the performance of the model fine-tuned with synthetic data to the baseline (not fine-tuned) model and to a model fine-tuned with an equal amount of original training data.

Prerequisites

To generate synthetic data and fine-tune models using Amazon Bedrock, you first need to create an AWS Identity and Access Management (IAM) service role with the appropriate permissions. This role is used by Amazon Bedrock to access the necessary resources on your behalf.

For instructions on creating the service role, refer to Create a service role for model customization. Also, make sure the role has the permission for the bedrock:InvokeModel action.

If you’re running this code using an Amazon SageMaker notebook instance, edit the IAM role that’s attached to the notebook (for example, AmazonSageMaker-ExecutionRole-XXX) instead of creating a new role. Follow Create a service role for model customization to modify the trust relationship and add the S3 bucket permission. Additionally, on the role’s Permissions tab, create the following inline policies:

Policy name: bedrock-customization

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “VisualEditor0”,
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel”,
“bedrock:ListModelCustomizationJobs”,
“bedrock:DeleteCustomModel”,
“bedrock:CreateModelCustomizationJob”,
“bedrock:StopModelCustomizationJob”,
“bedrock:ListCustomModels”,
“bedrock:GetCustomModel”,
“bedrock:GetModelCustomizationJob”
],
“Resource”: “*”
}
]
}

Policy name: iam-pass-role

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “VisualEditor0”,
“Effect”: “Allow”,
“Action”: “iam:PassRole”,
“Resource”: [
“${sagemaker-execution-role-arn}”
]
}
]
}

The final permission policies for the SageMaker execution role should look like the following, which include AmazonSageMaker-ExecutionPolicy, AmazonSageMakerFullAccess, bedrock-customization, and iam-pass-role.

Generate synthetic data using the Amazon Bedrock InvokeModel API

We use the Amazon Bedrock InvokeModel API to generate synthetic data for fine-tuning. You can use the API to programmatically send an inference (text generation) request to the model of your choice. All you need is a well-crafted prompt tailored for data synthesis. We used the following sample prompt for our use case:

PROMPT = “””
You are an AI assistant who is an expert in Amazon services. Your task is to understand a system that takes in a list of documents, and based on that, answers a question by providing citations for the documents that it referred the answer from.

Your job is to generate three new Question/Answer pairs, emulating the tone, style, and grammar of the original data provided.

Here is the original data :
Input Documents and Question : {document}nnQuestion: {question}
Output Answer : {answer}

Strictly return a jsonl with the keys (question, answer, topic). Every topic should be different. The answers should be in the exact same format as the original. The question and the answer should be different in content from the original data provided, and all questions should be diverse and different from each other. Do not answer in any other format. The response should be parsable as a jsonl.
“””

The goal of our use case was to fine-tune a model to generate a relevant and coherent answer based on a given reference document and a question. RAG is a popular technique used for such Q&A tasks; however, one significant challenge with RAG is the potential for retrieving unrelated or irrelevant documents, which can lead to inaccurate responses. You can apply fine-tuning to guide the model to better focus on the relevance of the documents to the question instead of using the provided documents without context to answer the question.

Our dataset includes Q&A pairs with reference documents regarding AWS services. Each sample has up to five reference documents as context, and a single-line question follows. The following table shows an example.

document

Context:

Document 1:

Step 1: Prepare to work with AWS CodeStar projects

In this step, you create an AWS CodeStar service role and an Amazon EC2 key pair, so that you can begin creating and working with AWS CodeStar projects. If you have used AWS CodeStar before, skip ahead to Step 2

Step 2: Create a Project in AWS CodeStar.

For this step, follow the instructions in Setting Up AWS CodeStar in the AWS CodeStar User Guide. Do not create a new AWS account, IAM user, or IAM group as part of those instructions. Use the ones you created or identified in Team Setup for AWS Cloud9. When you finish following those instructions, return to this topic.

Document 2:

Setting Up AWS CodeStar

Before you can start using AWS CodeStar, you must complete the following steps.

Topics:

Step 1: Create an account

Step 2: Create the AWS CodeStar Service Role

Step 3: Configure the User’s IAM Permissions

Step 4: Create an Amazon EC2 Key Pair for AWS CodeStar Projects

Step 5: Open the AWS CodeStar Console

Next Steps

Document 3:

How Do I Get Started with AWS CodeStar?

To get started with AWS CodeStar:

Prepare to use AWS CodeStar by following the steps in Setting Up AWS CodeStar.

Experiment with AWS CodeStar by following the steps in the Getting Started with AWS CodeStar tutorial.

Share your project with other developers by following the steps in Add Team Members to an AWS CodeStar Project.

Integrate your favorite IDE by following the steps in Use an IDE with AWS CodeStar.

Document 4:

Step 2: Create a project in AWS CodeStar

In this step, you create a project in AWS CodeStar.

If you already have a project in AWS CodeStar you want to use, skip ahead to Step 3: Create an AWS Cloud9 Development Environment and Connect It to the Project.

For this step, follow the instructions in Create a Project in AWS CodeStar in the AWS CodeStar User Guide. In the AWS CodeStar create project wizard, when you get to the Setup tools page or Connect to your source repository page, choose Skip, and then return to this topic.

Document 5:

Create a Project in AWS CodeStar

You use the AWS CodeStar console to create a project. If you use a project template, it sets up the required resources for you. The template also includes sample code that you can use to start coding.

To create a project, sign in to the AWS Management Console with an IAM user that has the AWSCodeStarFullAccess policy or equivalent permissions. For more information, see Setting Up AWS CodeStar.

Note: You must complete the steps in Setting Up AWS CodeStar before you can complete the procedures in this topic.

Topics

Create a Project in AWS CodeStar (Console)

Create a Project in AWS CodeStar (AWS CLI)

Answer the following question to the best of your ability. Use and paraphrase the information in the context, if it answers the question. Otherwise, rely on your knowledge and memory.

question
How do I get started with AWS CodeStar?

answer

To get started with AWS CodeStar, follow these steps:

1. Prepare to work with AWS CodeStar projects:

a. Create an AWS CodeStar service role by following the instructions in “Setting Up AWS CodeStar” (Document 2) in the AWS CodeStar User Guide [1].

b. Create an Amazon EC2 key pair for AWS CodeStar projects, as described in the same guide [1].

2. Create a project in AWS CodeStar:

a. Sign in to the AWS Management Console with an IAM user that has the AWSCodeStarFullAccess policy or equivalent permissions [2].

b. In the AWS CodeStar console, select one of the available project templates, which will automatically provision all the required resources for your project [3].

c. Once your project is provisioned, you can see your running application from the Application endpoints tile [3].

3. Connect to the AWS CodeCommit source repository for your project:

a. Use the steps provided in the CodeStar console to connect to the AWS CodeCommit source repository for your project [3].

b. Begin coding and making changes to your project’s source code.

4. Track and manage changes in the release process:

a. Use the project dashboard to track and manage changes in the release process [3].

b. View the most recent project activity on the dashboard.

5. Share your project with other developers:

a. Follow the steps in “Add Team Members to an AWS CodeStar Project” to share your project with other developers [3].

6. Integrate your favorite IDE:

a. Follow the steps in “Use an IDE with AWS CodeStar” to integrate your preferred Integrated Development Environment (IDE) with AWS CodeStar [3].

[1] https://docs.aws.amazon.com/codestar/latest/userguide/setting-up.html

[2] https://docs.aws.amazon.com/codestar/latest/userguide/create-project.html

[3] https://docs.aws.amazon.com/codestar/latest/userguide/getting-started.html

For data synthesis, we asked the model to generate three new Q&A pairs per reference document. However, you can adjust the number as needed. The crucial part is to make the model think deeply about a variety of topics. Because the purpose of generating synthetic data is to enrich the training dataset, it’s more beneficial to have the model look at different parts of the documents and create Q&A pairs with different topics than the original.

The following example shows how to generate synthetic data with the Amazon Bedrock InvokeModel API. We tested the preceding prompt with Anthropic’s Claude 3 Sonnet. If you want to test a different model, retrieve the corresponding model ID from Amazon Bedrock model IDs, and replace the modelId variable in the function.

import boto3
import json

bedrock = boto3.client(service_name=”bedrock-runtime”)

def generate_synthetic_data(document, question, answer):

values = {
“document”: document,
“question”: question,
“answer”: answer
}

body = {
“messages”: [{
“role”: “user”, “content”: PROMPT.format(**values)
}],
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 2048,
“temperature” : 0.5
}

response = bedrock.invoke_model(
body=json.dumps(body),
modelId=”anthropic.claude-3-sonnet-20240229-v1:0″,
accept=”application/json”,
contentType=”application/json”
)

response_body = json.loads(response.get(‘body’).read())

return response_body[‘content’][0][‘text’]

The preceding function returns three JSONL records in strings with question, answer, and topic as keys. The following parse_llm_output function loads the strings and uses regular expressions to retrieve the generated questions and answers. Then, the create_synthetic_samples function combines those two functionalities to produce the final synthetic training samples.

import re
import pd

def parse_llm_output(jsonl_string):

question_pattern = re.compile(r'”question”:s*”([^”]+)”‘)
answer_pattern = re.compile(r'”answer”:s*”(.*?)”s*,s*”topic”‘)
questions = question_pattern.findall(jsonl_string)
answers = answer_pattern.findall(jsonl_string)

return questions, answers

def create_synthetic_samples(row: pd.Series) -> pd.DataFrame:

jsonl_string = generate_synthetic_data(row[‘document’], row[‘question’], row[‘answer’])
questions, answers = parse_llm_output(jsonl_string)

return pd.DataFrame({
“document”: [row[‘document’]] * len(questions),
“question”: questions,
“answer”: answers
})

def to_customization_format(row):

msg = {
“messages”: [
{“role”: “user”, “content”: f”{row[‘document’]}nnQuestion: {row[‘question’]}”},
{“role”: “assistant”, “content”: row[‘answer’]}
]
}

return msg

The following script combines all of the preceding functions and gives you the final training set with both original and synthetic samples. We convert the samples into the format required by the customization job using the to_customization_format function and save them as train.jsonl. Assume the input data is a CSV file with three columns: document, question, and answer.

import pandas as pd

# Load original training samples
original_train = pd.read_csv(input_df_path)

# Create synthetic training samples
synthetic_train = pd.concat(original_train.apply(create_synthetic_samples, axis=1).tolist())

# Combine original and synthetic samples
final_train_df = pd.concat([original_train, synthetic_train])

# Convert to the format required by the customization job
final_train = final_train_df.apply(to_customization_format, axis=1).tolist()

# Write to JSONL file
with open(‘train.jsonl’, ‘w’) as file:
for item in final_train:
json.dump(item, file)
file.write(‘n’)

Fine-tune using an Amazon Bedrock custom model

Now that you have the synthetic data generated by the teacher model along with your original data, it’s time to train the student model. We fine-tune the student model using the Amazon Bedrock custom model functionality.

Model customization is the process of providing training data to an FM to improve its performance for specific use cases. Amazon Bedrock offers three model customization methods as of this writing:

Fine-tuning
Continued pre-training
Distillation (preview).

You can create your own custom model using any of these methods through the Amazon Bedrock console or API. For more information on supported models and AWS Regions with various customization methods, please see User guide for model customization. In this section, we focus on how to fine-tune a model using the API.

To create a fine-tuning job in Amazon Bedrock, complete the following prerequisite steps:

Create an Amazon Simple Storage Service (Amazon S3) bucket for your training data and another one for your output data (the names must be unique).
Upload the jsonl file to the training data bucket.
Make sure that you have created an IAM role, as described in the Prerequisites

When these steps are complete, run the following code to submit a new fine-tuning job. In our use case, the student model was Anthropic’s Claude Instant 1.2. At the time of writing, Anthropic’s Claude 3 Haiku is generally available, and we recommend following the rest of the code using Anthropic’s Claude 3 Haiku. For the release announcement, see Fine-tuning for Anthropic’s Claude 3 Haiku in Amazon Bedrock is now generally available.

If you want to try different models, you must check the model provider’s terms of service yourself. Many providers restrict using their models to train competing models. For the latest model support information, see Supported Regions and models for model customization, and replace baseModelIdentifier accordingly. Different models have different hyperparameters. For more information, see Custom model hyperparameters.

import boto3
import json
import time

bedrock = boto3.client(service_name=’bedrock’)

# Set parameters
customizationType = “FINE_TUNING”
baseModelIdentifier = “arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-instant-v1:2:100k”
roleArn = “${customization-role-arn}”
jobName = “${customization-job-name}”
customModelName = “${customization-model-name}”
hyperParameters = {
“epochCount”: “1”,
“batchSize”: “96”,
“learningRateMultiplier”: “0.5”,
}
trainingDataConfig = {“s3Uri”: “s3://${training-bucket}/train.jsonl”}
outputDataConfig = {“s3Uri”: “s3://${output-bucket}/myOutputData”}

# Create job
response_ft = bedrock.create_model_customization_job(
jobName=jobName,
customModelName=customModelName,
roleArn=roleArn,
baseModelIdentifier=baseModelIdentifier,
hyperParameters=hyperParameters,
trainingDataConfig=trainingDataConfig,
outputDataConfig=outputDataConfig
)

jobArn = response_ft.get(‘jobArn’)

# Check job status
while True:
status = bedrock.get_model_customization_job(jobIdentifier=jobArn).get(‘status’)
if status != ‘InProgress’:
print(status)
break
else:
print(status)
time.sleep(30)

When the status changes to Completed, your fine-tuned student model is ready for use. To run an inference with this custom model, you need to purchase provisioned throughput. A flexible No commitment option is available for custom models, which can be turned off when not in use and billed by the hour. A cost estimate is provided on the console prior to purchasing provisioned throughput.

On the Amazon Bedrock console, choose Custom models in the navigation pane. Select the model you fine-tuned and choose Purchase provisioned throughput.

The model name and type are automatically selected for you. Select No commitment for Commitment term. After you make this selection, the estimated cost is shown. If you’re okay with the pricing, choose Confirm purchase.

When the Provisioned Throughput becomes available, retrieve the ARN of the provisioned custom model and run the inference:

import boto3
import json

bedrock = boto3.client(service_name=”bedrock-runtime”)

def run_student_model(document, question):

values = {
“document”: document,
“question”: question,
}

body = {
“messages”: [{
“role”: “user”, “content”: PROMPT.format(**values)
}],
“max_tokens”: 2048,
“temperature” : 0.5
}

response = bedrock.invoke_model(
body=json.dumps(body),
modelId=”${provisioned_model_arn}”,
accept=”application/json”,
contentType=”application/json”
)

response_body = json.loads(response.get(‘body’).read())

return response_body[‘content’][0][‘text’]

Evaluate

In this section, we share our experiment results to provide data points on how the synthetic data generated by a teacher model can improve the performance of a student model. For evaluation methods, we used an LLM-as-a-judge approach, where a judge model compares responses from two different models and picks a better response. Additionally, we conducted a manual evaluation on a small subset to assess whether the LLM-as-a-judge and human judges have aligned preferences.

We carried out controlled experiments where we compared four different models as follows: 1,500 synthetic training samples for the 4th model were generated by Anthropic’s Claude 3 Sonnet, and we created three synthetic samples per one original reference document (3 samples * 500 original reference documents = 1,500 synthetic samples).

Instant base model
Anthropic’s Claude Instant without any customization

Fine-tuned 500 original
Anthropic’s Claude Instant fine-tuned with 500 original training samples

Fine-tuned 2,000 original
Anthropic’s Claude Instant fine-tuned with 2,000 original training samples

Fine-tuned with synthetic
Anthropic’s Claude Instant fine-tuned with 500 original training samples plus 1,500 synthetic training samples

LLM-as-a-judge results

LLM output evaluation is an important step in developing generative AI applications, but it is expensive and takes considerable time if done manually. An alternative solution to systematically evaluate output quality in large volume is the LLM-as-a-judge approach, where an LLM is used to evaluate another LLM’s responses.

For our use case, we used Anthropic’s Claude 3 Sonnet and Meta Llama 3 70B as the judges. We asked the LLM judges to compare outputs from two different models and choose one over the other or state a tie. The following chart summarizes the judges’ decisions. Each number represents the percentage of times when the respective model was selected as providing a better answer, excluding tie cases. The test set contained 343 samples.

As shown in the preceding chart, the Anthropic’s Claude 3 Sonnet judge preferred the response from the fine-tuned model with synthetic examples over the Anthropic’s Claude Instant base model (84.8% preference) and the fine-tuned model with original 500 samples (72.3% preference). However, the judge concluded that the fine-tuned model with 2,000 original examples was preferred over the fine-tuned model with synthetic examples (32.3% preference). This aligns with the expectation that when large, high-quality original data is available, it’s better to use the large training data that accurately reflects the target data distribution.

The Meta Llama judge reached a similar conclusion. As shown in the preceding chart, it preferred the response from the fine-tuned model with synthetic samples over the Anthropic’s Claude Instant base model (75.6% preference) and the fine-tuned model with original 500 examples (76.4% preference), but the fine-tuned model with 2,000 original examples was the ultimate winner.

Human evaluation results

To complement the LLM-as-a-judge result, we conducted manual evaluation with two human judges. We asked the two human evaluators to perform the same pairwise comparison task as the LLM judge, but for 20 examples. The following chart summarizes the results.

As shown in the preceding chart, the two human evaluators reached a similar conclusion, reinforcing the LLM-as-a-judge result. The fine-tuned model with synthetic examples produced outputs that were more preferable than the Anthropic’s Claude Instant base model and the fine-tuned model with the original 500 examples; however, it didn’t outperform the fine-tuned model with the 2,000 original examples.

These comparative evaluation results from both the LLM judges and human judges strongly demonstrate the power and potential of using data synthesis when training data is scarce. Moreover, by using high-quality data from the teacher model, we can effectively train the student model, which is lightweight and cost-effective for deployment in a production environment.

Amazon Bedrock evaluations

Running LLM-as-a-judge and human evaluation has become much easier with Amazon Bedrock. Model evaluation on Amazon Bedrock allows you to evaluate, compare, and select the best FMs for your use case. Human evaluation workflows can use your own employees or an AWS-managed team as reviewers. For more information on how to set up a human evaluation workflow, see Creating your first model evaluation that uses human workers. The latest feature, LLM-as-a-judge, is now in preview and allows you to assess multiple quality dimensions including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness. For step-by-step instructions, see New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock.

Clean up

Make sure to delete the following resources to avoid incurring cost:

Provisioned throughput for the custom model
The training_bucket and output_bucket S3 buckets

Conclusion

In this post, we explored how to use Amazon Bedrock to generate synthetic training data using a large teacher language model and fine-tune a smaller student model with synthetic data. We provided instructions on generating synthetic data using the Amazon Bedrock InvokeModel API and fine-tuning the student model using an Amazon Bedrock custom model. Our evaluation results, based on both an LLM-as-a-judge approach and human evaluation, demonstrated the effectiveness of synthetic data in improving the student model’s performance when original training data is limited.

Although fine-tuning with a large amount of high-quality original data remains the ideal approach, our findings highlight the promising potential of synthetic data generation as a viable solution when dealing with data scarcity. This technique can enable more efficient and cost-effective model customization for domain-specific or specialized use cases.

If you’re interested in working with the AWS Generative AI Innovation Center and learning more about LLM customization and other generative AI use cases, visit Generative AI Innovation Center.

About the Author

Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.

Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.

Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she develops generative AI solutions for AWS customers. Her expertise encompasses designing and implementing innovative AI-driven and deep learning techniques, focusing on natural language processing, computer vision, multi-modal learning, and graph learning. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research centered on advanced machine learning and deep learning methodologies. Outside of work, she enjoys sports, hiking, and traveling.

Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.