Best practices and lessons for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock

Fine-tuning is a powerful approach in natural language processing (NLP) and generative AI, allowing businesses to tailor pre-trained large language models (LLMs) for specific tasks. This process involves updating the model’s weights to improve its performance on targeted applications. By fine-tuning, the LLM can adapt its knowledge base to specific data and tasks, resulting in enhanced task-specific capabilities. To achieve optimal results, having a clean, high-quality dataset is of paramount importance. A well-curated dataset forms the foundation for successful fine-tuning. Additionally, careful adjustment of hyperparameters such as learning rate multiplier and batch size plays a crucial role in optimizing the model’s adaptation to the target task.

The capabilities in Amazon Bedrock for fine-tuning LLMs offer substantial benefits for enterprises. This feature enables companies to optimize models like Anthropic’s Claude 3 Haiku on Amazon Bedrock for custom use cases, potentially achieving performance levels comparable to or even surpassing more advanced models such as Anthropic’s Claude 3 Opus or Anthropic’s Claude 3.5 Sonnet. The result is a significant improvement in task-specific performance, while potentially reducing costs and latency. This approach offers a versatile solution to satisfy your goals for performance and response time, allowing businesses to balance capability, domain knowledge, and efficiency in your AI-powered applications.

In this post, we explore the best practices and lessons learned for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock. We discuss the important components of fine-tuning, including use case definition, data preparation, model customization, and performance evaluation. This post dives deep into key aspects such as hyperparameter optimization, data cleaning techniques, and the effectiveness of fine-tuning compared to base models. We also provide insights on how to achieve optimal results for different dataset sizes and use cases, backed by experimental data and performance metrics.

As part of this post, we first introduce general best practices for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock, and then present specific examples with the TAT- QA dataset (Tabular And Textual dataset for Question Answering).

Recommended use cases for fine-tuning

The use cases that are the most well-suited for fine-tuning Anthropic’s Claude 3 Haiku include the following:

Classification – For example, when you have 10,000 labeled examples and want Anthropic’s Claude 3 Haiku to do well at this task.
Structured outputs – For example, when you have 10,000 labeled examples specific to your use case and need Anthropic’s Claude 3 Haiku to accurately identify them.
Tools and APIs – For example, when you need to teach Anthropic’s Claude 3 Haiku how to use your APIs well.
Particular tone or language – For example, when you need Anthropic’s Claude 3 Haiku to respond with a particular tone or language specific to your brand.

Fine-tuning Anthropic’s Claude 3 Haiku has demonstrated superior performance compared to few-shot prompt engineering on base Anthropic’s Claude 3 Haiku, Anthropic’s Claude 3 Sonnet, and Anthropic’s Claude 3.5 Sonnet across various tasks. These tasks include summarization, classification, information retrieval, open-book Q&A, and custom language generation such as SQL. However, achieving optimal performance with fine-tuning requires effort and adherence to best practices.

To better illustrate the effectiveness of fine-tuning compared to other approaches, the following table provides a comprehensive overview of various problem types, examples, and their likelihood of success when using fine-tuning versus prompting with Retrieval Augmented Generation (RAG). This comparison can help you understand when and how to apply these different techniques effectively.

Problem
Examples
Likelihood of Success with Fine-tuning
Likelihood of Success with Prompting + RAG

Make the model follow a specific format or tone
Instruct the model to use a specific JSON schema or talk like the organization’s customer service reps
Very High
High

Teach the model a new skill
Teach the model how to call APIs, fill out proprietary documents, or classify customer support tickets
High
Medium

Teach the model a new skill, and hope it learns similar skills
Teach the model to summarize contract documents, in order to learn how to write better contract documents
Low
Medium

Teach the model new knowledge, and expect it to use that knowledge for general tasks
Teach the model the organizations’ acronyms or more music facts
Low
Medium

Prerequisites

Before diving into the best practices and optimizing fine-tuning LLMs on Amazon Bedrock, familiarize yourself with the general process and how-to outlined in Fine-tune Anthropic’s Claude 3 Haiku in Amazon Bedrock to boost model accuracy and quality. The post provides essential background information and context for the fine-tuning process, including step-by-step guidance on fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock both through the Amazon Bedrock console and Amazon Bedrock API.

LLM fine-tuning lifecycle

The process of fine-tuning an LLM like Anthropic’s Claude 3 Haiku on Amazon Bedrock typically follows these key stages:

Use case definition – Clearly define the specific task or knowledge domain for fine-tuning
Data preparation – Gather and clean high-quality datasets relevant to the use case
Data formatting – Structure the data following best practices, including semantic blocks and system prompts where appropriate
Model customization – Configure the fine-tuning job on Amazon Bedrock, setting parameters like learning rate and batch size, enabling features like early stopping to prevent overfitting
Training and monitoring – Run the training job and monitor the status of training job
Performance evaluation – Assess the fine-tuned model’s performance against relevant metrics, comparing it to base models
Iteration and deployment – Based on the result, refine the process if needed, then deploy the model for production

Throughout this journey, depending on the business case, you may choose to combine fine-tuning with techniques like prompt engineering for optimal results. The process is inherently iterative, allowing for continuous improvement as new data or requirements emerge.

Use case and dataset

The TAT-QA dataset is related to a use case for question answering on a hybrid of tabular and textual content in finance where tabular data is organized in table formats such as HTML, JSON, Markdown, and LaTeX. We focus on the task of answering questions about the table. The evaluation metric is the F1 score that measures the word-to-word matching of the extracted content between the generated output and the ground truth answer. The TAT-QA dataset has been divided into train (28,832 rows), dev (3,632 rows), and test (3,572 rows).

The following screenshot provides a snapshot of the TAT-QA data, which comprises a table with tabular and textual financial data. Following this financial data table, a detailed question-answer set is presented to demonstrate the complexity and depth of analysis possible with the TAT-QA dataset. This comprehensive table is from the paper TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance, and it includes several key components:

Reasoning types – Each question is categorized by the type of reasoning required
Questions – A variety of questions that test different aspects of understanding and interpreting the financial data
Answers – The correct responses to each question, showcasing the precision required in financial analysis
Scale – Where applicable, the unit of measurement for the answer
Derivation – For some questions, the calculation or logic used to arrive at the answer is provided

The following screenshot shows a formatted version of the data as JSONL and is passed to Anthropic’s Claude 3 Haiku for fine-tuning training data. The preceding table has been structured in JSONL format with system, user role (which contains the data and the question), and assistant role (which has answers). The table is enclosed within the XML tag <table><table>, helping Anthropic’s Claude 3 Haiku parse the prompt with the data from the table. For the model fine-tuning and performance evaluation, we randomly selected 10,000 examples from the TAT-QA dataset to fine-tune the model, and randomly picked 3,572 records from the remainder of the dataset as testing data.

Best practices for data cleaning and data validation

When fine-tuning the Anthropic’s Claude 3 Haiku model, the quality of training data is paramount and serves as the primary determinant of the output quality, surpassing the importance of any other step in the fine-tuning process. Our experiments have consistently shown that high-quality datasets, even if smaller in size, yield better results than a larger but less refined one. This “quality over quantity” approach should guide the entire data preparation process. Data cleaning and validation are essential steps in maintaining the quality of the training set. The following are two effective methods:

Human evaluation – This method involves subject matter experts (SMEs) manually reviewing each data point for quality and relevance. Though time-consuming, it provides unparalleled insight into the nuances of the specific tasks.
LLM as a judge – For large datasets, using Anthropic’s Claude models as a judge can be more efficient. For example, you can use Anthropic’s Claude 3.5 Sonnet as a judge to decide whether each provided training record meets the high quality requirement. The following is an example prompt template:

{‘prompt’: {
‘system’: “You are a reliable and impartial expert judge in question/answering data assessment. “,
‘messages’: [
{‘role’: ‘user’, ‘content’: [{‘type’: ‘text’, ‘text’: ‘Your task is to take a question, an answer, and a context which may include multiple documents, and provide a judgment on whether the answer to the question is correct or not. This decision should be based either on the provided context or your general knowledge and memory. If the answer contradicts the information in context, it’s incorrect. A correct answer is ideally derived from the given context. If no context is given, a correct answer should be factually true and directly and unambiguously address the question.nnProvide a short step-by-step reasoning with a maximum of 4 sentences within the <reason></reason> xml tags and provide a single correct or incorrect response within the <judgement></judgement> xml tags.n <context>n…n</context>n<question>n…n</question>n<answer>n…n</answer>n’}]}]}}

The following is a sample output from Anthropic’s Claude 3.5 Sonnet:

{‘id’: ‘job_id’,
‘type’: ‘message’,
‘role’: ‘assistant’,
‘model’: ‘claude-3-5-sonnet-20240620’,
‘content’: [{‘type’: ‘text’,
‘text’: ‘<reason>n1. I’ll check the table for information… </reason>nn<judgement>correct</judgement>’}],
‘stop_reason’: ‘end_turn’,
‘stop_sequence’: None,
‘usage’: {‘input_tokens’: 923, ‘output_tokens’: 90}}

This LLM-as-a-judge approach is effective for large datasets, allowing for efficient and consistent quality assessment across a wide range of examples. It can help identify and filter out low-quality or irrelevant data points, making sure only the most suitable examples are used for fine-tuning.

The format of your training data is equally important. Although it’s optional, it’s highly recommended to include a system prompt that clearly defines the model’s role and tasks. In addition, including rationales within XML tags can provide valuable context for the model and facilitate extraction of key information. Prompt optimization is one of the key factors in improving model performance. Following established guidelines, such as those provided by Anthropic, can significantly enhance results. This might include structuring prompts with semantic blocks within XML tags, both in training samples and at inference time.

By adhering to these best practices in data cleaning, validation, and formatting, you can create a high-quality dataset that forms the foundation for successful fine-tuning. In the world of model training, quality outweighs quantity, and a well-prepared dataset is key to unlocking the full potential of fine-tuning Anthropic’s Claude 3 Haiku.

Best practices for performing model customization training jobs

When fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock, it’s crucial to optimize your training parameters to achieve the best possible performance. Our experiments have revealed several key insights that can guide you in effectively setting up your customization training jobs.

One of the most critical aspects of fine-tuning is selecting the right hyperparameters, particularly learning rate multiplier and batch size (see the appendix in this post for definitions). Our experiment results have shown that these two factors can significantly impact the model’s performance, with improvements ranging from 2–10% across different tasks. For the learning rate multiplier, the value ranges between 0.1–2.0, with a default value of 1.0. We suggest starting with the default value and potentially adjusting this value based on your evaluation result. Batch size is another important parameter, and its optimal value can vary depending on your dataset size. Based on our hyperparameter tuning experiments across different use cases, the API allows a range of 4–256, with a default of 32. However, we’ve observed that dynamically adjusting the batch size based on your dataset size can lead to better results:

For datasets with 1,000 or more examples, aim for a batch size between 32–64
For datasets between 500–1,000 examples, a batch size between 16–32 is generally suitable
For smaller datasets with fewer than 500 examples, consider a batch size between 4–16

The following chart illustrates how model performance improves as the size of the training dataset increases, as well as the change of optimal parameters, using the TAT-QA dataset. Each data point is annotated with the optimal learning rate multiplier (LRM), batch size (BS), and number of epochs (Epoch) used to achieve the best performance with the dataset size. We can observe that larger datasets tend to benefit from higher learning rates and batch sizes, whereas smaller datasets require more training epochs. The red dashed line is the baseline Anthropic’s Claude 3 Haiku performance without fine-tuning efforts.

By following these guidelines, you can configure an Anthropic’s Claude 3 Haiku fine-tuning job with a higher chance of success. However, remember that these are general recommendations and the optimal settings may vary depending on your specific use case and dataset characteristics.

In scenarios with large amounts of data (1,000–10,000 examples), the learning rate tends to have a more significant impact on performance. Conversely, for smaller datasets (32–100 examples), the batch size becomes the dominant factor.

Performance evaluations

The fine-tuned Anthropic’s Claude 3 Haiku model demonstrated substantial performance improvements over base models when evaluated on the financial Q&A task, highlighting the effectiveness of the fine-tuning process on specialized data. Based on the evaluation results, we found the following:

Fine-tuned Anthropic’s Claude 3 Haiku performed better than Anthropic’s Claude 3 Haiku, Anthropic’s Claude 3 Sonnet, and Anthropic’s Claude 3.5 Sonnet for TAT-QA dataset across the target use case of question answering on financial text and tabular content.
For the performance evaluation metric F1 score (see the appendix for definition), fine-tuned Anthropic’s Claude 3 Haiku achieved a score of 91.2%, which is a 24.60% improvement over the Anthropic’s Claude 3 Haiku base model’s score of 73.2%. Fine-tuned Anthropic’s Claude 3 Haiku also achieved a 19.6% improvement over the Anthropic’s Claude 3 Sonnet base model’s performance, which obtained an F1 score of 76.3%. Fine-tuned Anthropic’s Claude 3 Haiku even achieved better performance over the Anthropic’s Claude 3.5 Sonnet base model.

The following table provides a detailed comparison of the performance metrics for the fine-tuned Claude 3 Haiku model against various base models, illustrating the significant improvements achieved through fine-tuning.

.
.
.
.
.
Fine-Tuned Model Performance
Base Model Performance
Improvement: Fine-Tuned Anthropic’s Claude 3 Haiku vs. Base Models

Target Use Case
Task Type
Fine-Tuning Data Size
Test Data Size
Eval Metric
Anthropic’s Claude 3 Haiku
Anthropic’s Claude 3 Haiku (Base Model)
Anthropic’s Claude 3 Sonnet
Anthropic’s Claude 3.5 Sonnet
vs. Anthropic’s Claude 3 Haiku Base
vs. Anthropic’s Claude 3 Sonnet Base
vs. Anthropic’s Claude 3.5 Sonnet Base

TAT-QA
Q&A on financial text and tabular content
10,000
3,572
F1 score
91.2%
73.2%
76.3%
83.0%
24.6%
19.6%
9.9%

Few-shot examples improve performance not only on the base model, but also on fine-tuned models, especially when the fine-tuning data is small.

Fine-tuning also demonstrated significant benefits in reducing token usage. On the TAT-QA HTML test set (893 examples), the fine-tuned Anthropic’s Claude 3 Haiku model reduced the average output token count by 35% compared to the base model, as shown in the following table.

Model
Average Output Token
% Reduced
Median
% Reduced
Standard Deviation
Minimum Token
Maximum Token

Anthropic’s Claude 3 Haiku Base
34
–
28
–
27
13
245

Anthropic’s Claude 3 Haiku Fine-Tuned
22
35%
17
39%
14
13
179

We use the following figures to illustrate the token count distribution for both the base Anthropic’s Claude 3 Haiku and fine-tuned Anthropic’s Claude 3 Haiku models. The left graph shows the distribution for the base model, and the right graph displays the distribution for the fine-tuned model. These histograms demonstrate a shift towards more concise output in the fine-tuned model, with a notable reduction in the frequency of longer token sequences.

To further illustrate this improvement, consider the following example from the test set:

Question: “How did the company adopt Topic 606?”
Ground truth answer: “the modified retrospective method”
Base Anthropic’s Claude 3 Haiku response: “The company adopted the provisions of Topic 606 in fiscal 2019 utilizing the modified retrospective method”
Fine-tuned Anthropic’s Claude 3 Haiku response: “the modified retrospective method”

As evident from this example, the fine-tuned model produces a more concise and precise answer, matching the ground truth exactly, whereas the base model includes additional, unnecessary information. This reduction in token usage, combined with improved accuracy, can lead to enhanced efficiency and reduced costs in production deployments.

Conclusion

Fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock offers significant performance improvements for specialized tasks. Our experiments demonstrate that careful attention to data quality, hyperparameter optimization, and best practices in the fine-tuning process can yield substantial gains over base models. Key takeaways include the following:

The importance of high-quality, task-specific datasets, even if smaller in size
Optimal hyperparameter settings vary based on dataset size and task complexity
Fine-tuned models consistently outperform base models across various metrics
The process is iterative, allowing for continuous improvement as new data or requirements emerge

Although fine-tuning provides impressive results, combining it with other techniques like prompt engineering may lead to even better outcomes. As LLM technology continues to evolve, mastering fine-tuning techniques will be crucial for organizations looking to use these powerful models for specific use cases and tasks.

Now you’re ready to fine-tune Anthropic’s Claude 3 Haiku on Amazon Bedrock for your use case. We look forward to seeing what you build when you put this new technology to work for your business.

Appendix

We used the following hyperparameters as part of our fine-tuning:

Learning rate multiplier – Learning rate multiplier is one of the most critical hyperparameters in LLM fine-tuning. It influences the learning rate at which model parameters are updated after each batch.
Batch size – Batch size is the number of training examples processed in one iteration. It directly impacts GPU memory consumption and training dynamics.
Epoch – One epoch means the model has seen every example in the dataset one time. The number of epochs is a crucial hyperparameter that affects model performance and training efficiency.

For our evaluation, we used the F1 score, which is an evaluation metric to assess the performance of LLMs and traditional ML models.

To compute the F1 score for LLM evaluation, we need to define precision and recall at the token level. Precision measures the proportion of generated tokens that match the reference tokens, and recall measures the proportion of reference tokens that are captured by the generated tokens. The F1 score ranges from 0–100, with 100 being the best possible score and 0 being the lowest. However, interpretation can vary depending on the specific task and requirements.

We calculate these metrics as follows:

Precision = (Number of matching tokens in generated text) / (Total number of tokens in generated text)
Recall = (Number of matching tokens in generated text) / (Total number of tokens in reference text)
F1 = (2 * (Precision * Recall) / (Precision + Recall)) * 100

For example, let’s say the LLM generates the sentence “The cat sits on the mat in the sun” and the reference sentence is “The cat sits on the soft mat under the warm sun.” The precision would be 6/9 (6 matching tokens out of 9 generated tokens), and the recall would be 6/11 (6 matching tokens out of 11 reference tokens).

Precision = 6/9 ≈ 0.667
Recall = 6/11 ≈ 0.545
F1 score = (2 * (0.667 * 0.545) / (0.667 + 0.545)) * 100 ≈ 59.90

About the Authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Sovik Kumar Nath is an AI/ML and Generative AI Senior Solutions Architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double master’s degrees from the University of South Florida and University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, and adventures.

Jennifer Zhu is a Senior Applied Scientist at AWS Bedrock, where she helps building and scaling generative AI applications with foundation models. Jennifer holds a PhD degree from Cornell University, and a master degree from University of San Francisco. Outside of work, she enjoys reading books and watching tennis games.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Yanjun Qi is a Senior Applied Science Manager at the Amazon Bedrock Science. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.