Advanced fine-tuning methods on Amazon SageMaker AI

Advanced fine-tuning methods on Amazon SageMaker AI

This post provides the theoretical foundation and practical insights needed to navigate the complexities of LLM development on Amazon SageMaker AI, helping organizations make optimal choices for their specific use cases, resource constraints, and business objectives.

We also address the three fundamental aspects of LLM development: the core lifecycle stages, the spectrum of fine-tuning methodologies, and the critical alignment techniques that provide responsible AI deployment. We explore how Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA have democratized model adaptation, so organizations of all sizes can customize large models to their specific needs. Additionally, we examine alignment approaches such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), which help make sure these powerful systems behave in accordance with human values and organizational requirements. Finally, we focus on knowledge distillation, which enables efficient model training through a teacher/student approach, where a smaller model learns from a larger one, while mixed precision training and gradient accumulation techniques optimize memory usage and batch processing, making it possible to train large AI models with limited computational resources.

Throughout the post, we focus on practical implementation while addressing the critical considerations of cost, performance, and operational efficiency. We begin with pre-training, the foundational phase where models gain their broad language understanding. Then we examine continued pre-training, a method to adapt models to specific domains or tasks. Finally, we discuss fine-tuning, the process that hones these models for particular applications. Each stage plays a vital role in shaping large language models (LLMs) into the sophisticated tools we use today, and understanding these processes is key to grasping the full potential and limitations of modern AI language models.

If you’re just getting started with large language models or looking to get more out of your current LLM projects, we’ll walk you through everything you need to know about fine-tuning methods on Amazon SageMaker AI.

Pre-training

Pre-training represents the foundation of LLM development. During this phase, models learn general language understanding and generation capabilities through exposure to massive amounts of text data. This process typically involves training from scratch on diverse datasets, often consisting of hundreds of billions of tokens drawn from books, articles, code repositories, webpages, and other public sources.

Pre-training teaches the model broad linguistic and semantic patterns, such as grammar, context, world knowledge, reasoning, and token prediction, using self-supervised learning techniques like masked language modeling (for example, BERT) or causal language modeling (for example, GPT). At this stage, the model is not tailored to any specific downstream task but rather builds a general-purpose language representation that can be adapted later using fine-tuning or PEFT methods.

Pre-training is highly resource-intensive, requiring substantial compute (often across thousands of GPUs or AWS Trainium chips), large-scale distributed training frameworks, and careful data curation to balance performance with bias, safety, and accuracy concerns.

Continued pre-training (also known as domain-adaptive pre-training or intermediate pre-training) is the process of taking a pre-trained language model and further training it on domain-specific or task-relevant corpora before fine-tuning. Unlike full pre-training from scratch, this approach builds on the existing capabilities of a general-purpose model, allowing it to internalize new patterns, vocabulary, or context relevant to a specific domain.

This step is particularly useful when the models must handle specialized terminology or unique syntax, particularly in fields like law, medicine, or finance. This approach is also essential when organizations need to align AI outputs with their internal documentation standards and proprietary knowledge bases. Additionally, it serves as an effective solution for addressing gaps in language or cultural representation by allowing focused training on underrepresented dialects, languages, or regional content.

To learn more, refer to the following resources:

Alignment methods for LLMs

The alignment of LLMs represents a crucial step in making sure these powerful systems behave in accordance with human values and preferences. AWS provides comprehensive support for implementing various alignment techniques, each offering distinct approaches to achieving this goal. The following are the key approaches.

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is one of the most established approaches to model alignment. This method transforms human preferences into a learned reward signal that guides model behavior. The RLHF process consists of three distinct phases. First, we collect comparison data, where human annotators choose between different model outputs for the same prompt. This data forms the foundation for training a reward model, which learns to predict human preferences. Finally, we fine-tune the language model using Proximal Policy Optimization (PPO), optimizing it to maximize the predicted reward.

Constitutional AI represents an innovative approach to alignment that reduces dependence on human feedback by enabling models to critique and improve their own outputs. This method involves training models to internalize specific principles or rules, then using these principles to guide generation and self-improvement. The reinforcement learning phase is similar to RLHF, except that pairs of responses are generated and evaluated by an AI model, as opposed to a human.

To learn more, refer to the following resources:

Direct Preference Optimization

Direct Preference Optimization (DPO) is an alternative to RLHF, offering a more straightforward path to model alignment. DPO alleviates the need for explicit reward modeling and complex RL training loops, instead directly optimizing the model’s policy to align with human preferences through a modified supervised learning approach.

The key innovation of DPO lies in its formulation of preference learning as a classification problem. Given pairs of responses where one is preferred over the other, DPO trains the model to assign higher probability to preferred responses. This approach maintains theoretical connections to RLHF while significantly simplifying the implementation process. When implementing alignment methods, the effectiveness of DPO heavily depends on the quality, volume, and diversity of the preference dataset. Organizations must establish robust processes for collecting and validating human feedback while mitigating potential biases in label preferences.

For more information about DPO, see Align Meta Llama 3 to human preferences with DPO Amazon SageMaker Studio and Amazon SageMaker Ground Truth.

Fine-tuning methods on AWS

Fine-tuning transforms a pre-trained model into one that excels at specific tasks or domains. This phase involves training the model on carefully curated datasets that represent the target use case. Fine-tuning can range from updating all model parameters to more efficient approaches that modify only a small subset of parameters. Amazon SageMaker HyperPod offers fine-tuning capabilities for supported foundation models (FMs), and Amazon SageMaker Model Training offers flexibility for custom fine-tuning implementations along with training the models at scale without the need to manage infrastructure.

At its core, fine-tuning is a transfer learning process where a model’s existing knowledge is refined and redirected toward specific tasks or domains. This process involves carefully balancing the preservation of the model’s general capabilities while incorporating new, specialized knowledge.

Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) involves updating model parameters using a curated dataset of input-output pairs that reflect the desired behavior. SFT enables precise behavioral control and is particularly effective when the model needs to follow specific instructions, maintain tone, or deliver consistent output formats, making it ideal for applications requiring high reliability and compliance. In regulated industries like healthcare or finance, SFT is often used after continued pre-training, which exposes the model to large volumes of domain-specific text to build contextual understanding. Although continued pre-training helps the model internalize specialized language (such as clinical or legal terms), SFT teaches it how to perform specific tasks such as generating discharge summaries, filling documentation templates, or complying with institutional guidelines. Both steps are typically essential: continued pre-training makes sure the model understands the domain, and SFT makes sure it behaves as required.However, because it updates the full model, SFT requires more compute resources and careful dataset construction. The dataset preparation process requires careful curation and validation to make sure the model learns the intended patterns and avoids undesirable biases.

For more details about SFT, refer to the following resources:

Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT) represents a significant advancement in model adaptation, helping organizations customize large models while dramatically reducing computational requirements and costs. The following table summarizes the different types of PEFT.

PEFT Type AWS Service How It Works Benefits
LoRA LoRA (Low-Rank Adaptation) SageMaker Training (custom implementation) Instead of updating all model parameters, LoRA injects trainable rank decomposition matrices into transformer layers, reducing trainable parameters Memory efficient, cost-efficient, opens up possibility of adapting larger models
QLoRA (Quantized LoRA) SageMaker Training (custom implementation) Combines model quantization with LoRA, loading the base model in 4-bit precision while adapting it with trainable LoRA parameters Further reduces memory requirements compared to standard LoRA
Prompt Tuning Additive SageMaker Training (custom implementation) Prepends a small set of learnable prompt tokens to the input embeddings; only these tokens are trained Lightweight and fast tuning, good for task-specific adaptation with minimal resources
P-Tuning Additive SageMaker Training (custom implementation) Uses a deep prompt (tunable embedding vector passed through an MLP) instead of discrete tokens, enhancing expressiveness of prompts More expressive than prompt tuning, effective in low-resource settings
Prefix Tuning Additive SageMaker Training (custom implementation) Prepends trainable continuous vectors (prefixes) to the attention keys and values in every transformer layer, leaving the base model frozen Effective for long-context tasks, avoids full model fine-tuning, and reduces compute needs

The selection of a PEFT method significantly impacts the success of model adaptation. Each technique presents distinct advantages that make it particularly suitable for specific scenarios. In the following sections, we provide a comprehensive analysis of when to employ different PEFT approaches.

Low-Rank Adaptation

Low-Rank Adaptation (LoRA) excels in scenarios requiring substantial task-specific adaptation while maintaining reasonable computational efficiency. It’s particularly effective in the following use cases:

  • Domain adaptation for enterprise applications – When adapting models to specialized industry vocabularies and conventions, such as legal, medical, or financial domains, LoRA provides sufficient capacity for learning domain-specific patterns while keeping training costs manageable. For instance, a healthcare provider might use LoRA to adapt a base model to medical terminology and clinical documentation standards.
  • Multi-language adaptation – Organizations extending their models to new languages find LoRA particularly effective. It allows the model to learn language-specific nuances while preserving the base model’s general knowledge. For example, a global ecommerce platform might employ LoRA to adapt their customer service model to different regional languages and cultural contexts.

To learn more, refer to the following resources:

Prompt tuning

Prompt tuning is ideal in scenarios requiring lightweight, switchable task adaptations. With prompt tuning, you can store multiple prompt vectors for different tasks without modifying the model itself. A primary use case could be when different customers require slightly different versions of the same basic functionality: prompt tuning allows efficient switching between customer-specific behaviors without loading multiple model versions. It’s useful in the following scenarios:

  • Personalized customer interactions – Companies offering software as a service (SaaS) platform with customer support or virtual assistants can use prompt tuning to personalize response behavior for different clients without retraining the model. Each client’s brand tone or service nuance can be encoded in prompt vectors.
  • Task switching in multi-tenant systems – In systems where multiple natural language processing (NLP) tasks (for example, summarization, sentiment analysis, classification) need to be served from a single model, prompt tuning enables rapid task switching with minimal overhead.

For more information, see Prompt tuning for causal language modeling.

P-tuning

P-tuning extends prompt tuning by representing prompts as continuous embeddings passed through a small trainable neural network (typically an MLP). Unlike prompt tuning, which directly learns token embeddings, P-tuning enables more expressive and non-linear prompt representations, making it suitable for complex tasks and smaller models. It’s useful in the following use cases:

  • Low-resource domain generalization – A common use case includes low-resource settings where labeled data is limited, yet the task requires nuanced prompt conditioning to steer model behavior. For example, organizations operating in low-data regimes (such as niche scientific research or regional dialect processing) can use P-tuning to extract better task-specific performance without the need for large fine-tuning datasets.

To learn more, see P-tuning.

Prefix tuning

Prefix tuning prepends trainable continuous vectors, also called prefixes, to the key-value pairs in each attention layer of a transformer, while keeping the base model frozen. This provides control over the model’s behavior without altering its internal weights. Prefix tuning excels in tasks that benefit from conditioning across long contexts, such as document-level summarization or dialogue modeling. It provides a powerful compromise between performance and efficiency, especially when serving multiple tasks or clients from a single frozen base model. Consider the following use case:

  • Dialogue systems – Companies building dialogue systems with varied tones (for example, friendly vs. formal) can use prefix tuning to control the persona and coherence across multi-turn interactions without altering the base model.

For more details, see Prefix tuning for conditional generation.

LLM optimization

LLM optimization represents a critical aspect of their development lifecycle, enabling more efficient training, reduced computational costs, and improved deployment flexibility. AWS provides a comprehensive suite of tools and techniques for implementing these optimizations effectively.

Quantization

Quantization is a process of mapping a large set of input values to a smaller set of output values. In digital signal processing and computing, it involves converting continuous values to discrete values and reducing the precision of numbers (for example, from 32-bit to 8-bit). In machine learning (ML), quantization is particularly important for deploying models on resource-constrained devices, because it can significantly reduce model size while maintaining acceptable performance. One of the most used techniques is Quantized Low-Rank Adaptation (QLoRA).QLoRA is an efficient fine-tuning technique for LLMs that combines quantization and LoRA approaches. It uses 4-bit quantization to reduce model memory usage while maintaining model weights in 4-bit precision during training and employs double quantization for further memory reduction. The technique integrates LoRA by adding trainable rank decomposition matrices and keeping adapter parameters in 16-bit precision, enabling PEFT. QLoRA offers significant benefits, including up to 75% reduced memory usage, the ability to fine-tune large models on consumer GPUs, performance comparable to full fine-tuning, and cost-effective training of LLMs. This has made it particularly popular in the open-source AI community because it makes working with LLMs more accessible to developers with limited computational resources.

To learn more, refer to the following resources:

Knowledge distillation

Knowledge distillation is a groundbreaking model compression technique in the world of AI, where a smaller student model learns to emulate the sophisticated behavior of a larger teacher model. This innovative approach has revolutionized the way we deploy AI solutions in real-world applications, particularly where computational resources are limited. By learning not only from ground truth labels but also from the teacher model’s probability distributions, the student model can achieve remarkable performance while maintaining a significantly smaller footprint. This makes it invaluable for various practical applications, from powering AI features on mobile devices to enabling edge computing solutions and Internet of Things (IoT) implementations. The key feature of distillation lies in its ability to democratize AI deployment—making sophisticated AI capabilities accessible across different platforms without compromising too much on performance. With knowledge distillation, you can run real-time speech recognition on smartphones, implement computer vision systems in resource-constrained environments, optimize NLP tasks for faster inference, and more.

For more information about knowledge distillation, refer to the following resources:

Mixed precision training

Mixed precision training is a cutting-edge optimization technique in deep learning that balances computational efficiency with model accuracy. By intelligently combining different numerical precisions—primarily 32-bit (FP32) and 16-bit (FP16) floating-point formats—this approach revolutionizes how we train complex AI models. Its key feature is selective precision usage: maintaining critical operations in FP32 for stability while using FP16 for less sensitive calculations, resulting in a balance of performance and accuracy. This technique has become a game changer in the AI industry, enabling up to three times faster training speeds, a significantly reduced memory footprint, and lower power consumption. It’s particularly valuable for training resource-intensive models like LLMs and complex computer vision systems. For organizations using cloud computing and GPU-accelerated workloads, mixed precision training offers a practical solution to optimize hardware utilization while maintaining model quality. This approach has effectively democratized the training of large-scale AI models, making it more accessible and cost-effective for businesses and researchers alike.

To learn more, refer to the following resources:

Gradient accumulation

Gradient accumulation is a powerful technique in deep learning that addresses the challenges of training large models with limited computational resources. Developers can simulate larger batch sizes by accumulating gradients over multiple smaller forward and backward passes before performing a weight update. Think of it as breaking down a large batch into smaller, more manageable mini batches while maintaining the effective training dynamics of the larger batch size. This method has become particularly valuable in scenarios where memory constraints would typically prevent training with optimal batch sizes, such as when working with LLMs or high-resolution image processing networks. By accumulating gradients across several iterations, developers can achieve the benefits of larger batch training—including more stable updates and potentially faster convergence—without requiring the enormous memory footprint typically associated with such approaches. This technique has democratized the training of sophisticated AI models, making it possible for researchers and developers with limited GPU resources to work on cutting-edge deep learning projects that would otherwise be out of reach. For more information, see the following resources:

Conclusion

When fine-tuning ML models on AWS, you can choose the right tool for your specific needs. AWS provides a comprehensive suite of tools for data scientists, ML engineers, and business users to achieve their ML goals. AWS has built solutions to support various levels of ML sophistication, from simple SageMaker training jobs for FM fine-tuning to the power of SageMaker HyperPod for cutting-edge research.

We invite you to explore these options, starting with what suits your current needs, and evolve your approach as those needs change. Your journey with AWS is just beginning, and we’re here to support you every step of the way.


About the authors

Ilan Gleiser is a Principal GenAI Specialist at AWS on the WWSO Frameworks team, focusing on developing scalable generative AI architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 30 blog posts and delivered more than 100 machine learning and HPC prototypes globally over the last 5 years. Ilan holds a master’s degree in mathematical economics.

Prashanth Ramaswamy is a Senior Deep Learning Architect at the AWS Generative AI Innovation Center, where he specializes in model customization and optimization. In his role, he works on fine-tuning, benchmarking, and optimizing models by using generative AI as well as traditional AI/ML solutions. He focuses on collaborating with Amazon customers to identify promising use cases and accelerate the impact of AI solutions to achieve key business outcomes.

Deeksha Razdan is an Applied Scientist at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. Her work resolves around conducting research and developing generative AI solutions for various industries. She holds a master’s in computer science from UMass Amherst. Outside of work, Deeksha enjoys being in nature.

​ 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top