This blog post introduces the new Amazon Nova model evaluation features in Amazon SageMaker AI. This release adds custom metrics support, LLM-based preference testing, log probability capture, metadata analysis, and multi-node scaling for large evaluations.
The new features include:
- Custom metrics use the bring your own metrics (BYOM) functions to control evaluation criteria for your use case.
- Nova LLM-as-a-Judge handles subjective evaluations through pairwise A/B comparisons, reporting win/tie/loss ratios and Bradley-Terry scores with explanations for each judgment.
- Token-level log probabilities reveal model confidence, useful for calibration and routing decisions.
- Metadata passthrough keeps per-row fields for analysis by customer segment, domain, difficulty, or priority level without extra processing.
- Multi-node execution distributes workloads while maintaining stable aggregation, scaling evaluation datasets from thousands to millions of examples.
In SageMaker AI, teams can define model evaluations using JSONL files in Amazon Simple Storage Service (Amazon S3), then execute them as SageMaker training jobs with control over pre- and post-processing workflows with results delivered as structured JSONL with per-example and aggregated metrics and detailed metadata. Teams can then integrate results with analytics tools like Amazon Athena and AWS Glue, or directly route them into existing observability stacks, with consistent results.

The rest of this post introduces the new features and then demonstrates step-by-step how to set up evaluations, run judge experiments, capture and analyze log probabilities, use metadata for analysis, and configure multi-node runs in an IT support ticket classification example.
Features for model evaluation using Amazon SageMaker AI
When choosing which models to bring into production, proper evaluation methodologies require testing multiple models, including customized versions in SageMaker AI. To do so effectively, teams need identical test conditions passing the same prompts, metrics, and evaluation logic to different models. This makes sure score differences reflect model performance, not evaluation methods.
Amazon Nova models that are customized in SageMaker AI now inherit the full evaluation infrastructure as base models making it a fair comparison. Results land as structured JSONL in Amazon S3, ready for Athena queries or routing to your observability stack. Let’s discuss some of the new features available for model evaluation.
Bring your own metrics (BYOM)
Standard metrics might not always fit your specific requirements. Custom metrics features leverage AWS Lambda functions to handle data preprocessing, output post-processing, and metric calculation. For instance, a customer service bot needs empathy and brand consistency metrics; a medical assistant might require clinical accuracy measures. With custom metrics, you can test what matters for your domain.
In this feature, pre- and post-processor functions are encapsulated in a Lambda function that is used to process data before inference to normalize formats or inject context and to then calculate your custom metrics using post-processing function after the model responds. Finally, the results are aggregated using your choice of min, max, average, or sum, thereby offering greater flexibility when different test examples carry varying importance.
Multimodal LLM-as-a-judge evaluation
LLM-as-a-judge automates preference testing for text as well as multimodal tasks using Amazon Nova LLM-as-a-Judge models for response comparison. The system implements pairwise evaluation: for each prompt, it compares baseline and challenger responses, running the comparison in both forward and backward passes to detect positional bias. The output includes Bradley-Terry probabilities (the likelihood one response is preferred over another) with bootstrap-sampled confidence intervals, giving statistical confidence in preference results.
Nova LLM-as-a-Judge models are purposefully customized for judging related evaluation tasks. Each judgment includes natural language rationales explaining why the judge preferred one response over other, helping with targeted improvements rather than blind optimization. Nova LLM-as-a-Judge evaluates complex reasoning tasks like support ticket classification, where nuanced understanding matters more than simple keyword matching.
The tie detection is equally valuable, identifying where models have reached parity. Combined with standard error metrics, you can determine whether performance differences are statistically meaningful or within noise margins; this is important when deciding if a model update justifies deployment.
Use log probability for model evaluation
Log probabilities show model confidence for each generated token, revealing insights into model uncertainty and prediction quality. Log probabilities support calibration studies, confidence routing, and hallucination detection beyond basic accuracy. Token-level confidence helps identify uncertain predictions for more reliable systems.
A Nova evaluation container with SageMaker AI model evaluation now captures token-level log probabilities during inference for uncertainty-aware evaluation workflows. The feature integrates with evaluation pipelines and provides the foundation for advanced diagnostic capabilities. You can correlate model confidence with actual performance, implement quality gates based on uncertainty thresholds, and detect potential issues before they impact production systems. Add log probability capture by adding the top_logprobs parameter to your evaluation configuration:
When combined with the metadata passthrough feature as discussed in the next section, log probabilities help with stratified confidence analysis across different data segments and use cases. This combination provides detailed insights into model behavior, so teams can understand not just where models fail, but why they fail and how confident they are in their predictions giving them more control over calibration.
Pass metadata information when using model evaluation
Custom datasets now support metadata fields when preparing the evaluation dataset. Metadata helps compare results across different models and datasets. The metadata field accepts any string for tagging and analysis with the input data and eval results. With the addition of the metadata field, the overall schema per data point in JSONL file becomes the following:
Enable multi-node evaluation
The evaluation container supports multi-node evaluation for faster processing. Set the replicas parameter to enable multi-node evaluation to a value greater than one.
Case study: IT support ticket classification assistant
The following case study demonstrates several of these new features using IT support ticket classification. In this use case, models classify tickets as hardware, software, network, or access issues while explaining their reasoning. This tests both accuracy and explanation quality, and shows custom metrics, metadata passthrough, log probability analysis, and multi-node scaling in practice.

Dataset overview
The support ticket classification dataset contains IT support tickets spanning different priority levels and technical domains, each with structured metadata for detailed analysis. Each evaluation example includes a support ticket query, the system context, a structured response containing the predicted category, the reasoning based on ticket content, and a natural language description. Amazon SageMaker Ground Truth responses include thoughtful explanations like Based on the error message mentioning network timeout and the user's description of intermittent connectivity, this appears to be a network infrastructure issue requiring escalation to the network team. The dataset includes metadata tags for difficulty level (easy/medium/hard based on technical complexity), priority (low/medium/high), and domain category, demonstrating how metadata passthrough works for stratified analysis without post-processing joins.
Prerequisites
Before you run the notebook, make sure the provisioned environment has the following:
- An AWS account
- AWS Identity and Access Management (IAM) permissions to create a Lambda function, the ability to run SageMaker training jobs within the associated AWS account in the previous step, and read and write permissions to an S3 bucket
- A development environment with SageMaker Python SDK and the Nova custom evaluation SDK (
nova_custom_evaluation_sdk)
Step 1: Prepare the prompt
For our support ticket classification task, we need to assess not only whether the model identifies the correct category, but also whether it provides coherent reasoning and adheres to structured output formats to have a complete overview required in production systems. For crafting the prompt, we are going to use Nova prompting best practices.
System prompt design: Starting with the system prompt, we establish the model’s role and expected behavior through a focused system prompt:
This prompt sets clear expectations: the model should act as a domain expert, base decisions on visual evidence, and prioritize accuracy. By framing the task as expert analysis rather than casual observation, we encourage more thoughtful, detailed responses.
Query structure: The query template requests both classification and justification:
The explicit request for reasoning is important—it forces the model to articulate its decision-making process, helping with evaluation of explanation quality alongside classification accuracy. This mirrors real-world requirements where model decisions often need to be interpretable for stakeholders or regulatory compliance.
Structured response format: We define the expected output as JSON with three components:
This structure supports the three-dimensional evaluation strategy we will discuss later in this post:
- class field – Classification accuracy metrics (precision, recall, F1)
- thought field – Reasoning coherence evaluation
- description field – Natural language quality assessment
By defining the response as parseable JSON, we help with automated metric calculation through our custom Lambda functions while maintaining human-readable explanations for model decisions. This prompt architecture transforms evaluation from simple right/wrong classification into a complete assessment of model capabilities. Production AI systems need to be accurate, explainable, and reliable in their output formatting—and our prompt design explicitly tests all three dimensions. The structured format also facilitates the metadata-driven stratified analysis we’ll use in later steps, where we can correlate reasoning quality with confidence scores and difficulty levels across different breed categories.
Step 2: Prepare the dataset for evaluation with metadata
In this step, we’ll prepare our support ticket dataset with metadata support to help with stratified analysis across different categories and difficulty levels. The metadata passthrough feature keeps custom fields complete for detailed performance analysis without post-hoc joins. Let’s review an example dataset.
Dataset schema with metadata
For our support ticket classification evaluation, we’ll use the enhanced gen_qa format with structured metadata:
Examine this further: how do we automatically generate structured metadata for each evaluation example? This metadata enrichment process analyzes the content to classify task types, assess difficulty levels, and identify domains, creating the foundation for stratified analysis in later steps. By embedding this contextual information directly into our dataset, we help the Nova evaluation pipeline keep these insights complete, so we can understand model performance across different segments without requiring complex post-processing joins.
Once our dataset is enriched with metadata, we need to export it in the JSONL format required by the Nova evaluation container.
The following export function formats our prepared examples with embedded metadata so that they are ready for the evaluation pipeline, maintaining the exact schema structure needed for the Amazon SageMaker processing workflow:
Step 3: Prepare custom metrics to evaluate custom models
After preparing and verifying your data adheres to the required schema, the next important step is to develop evaluation metrics code to assess your custom model’s performance. Use Nova evaluation container and the bring your own metric (BYOM) workflow to control your model evaluation pipeline with custom metrics and data workflows.
Introduction to BYOM workflow
With the BYOM feature, you can tailor your model evaluation workflow to your specific needs with fully customizable pre-processing, post-processing, and metrics capabilities. BYOM gives you control over the evaluation process, helping you to fine-tune and improve your model’s performance metrics according to your project’s unique requirements.
Key tasks for this classification problem
- Define tasks and metrics: In this use case, model evaluation requires three tasks:
- Class prediction accuracy: This will assess how accurately the model predicts the correct class for given inputs. For this we will use standard metrics such as accuracy, precision, recall, and F1 score to quantify performance.
- Schema adherence: Next, we also want to ensure that the model’s outputs conform to the specified schema. This step is important for maintaining consistency and compatibility with downstream applications. For this we will use validation techniques to verify that the output format matches the required schema.
- Thought process coherence: Next, we also want to evaluate the coherence and reasoning behind the model’s decisions. This involves analyzing the model’s thought process to help validate predictions are logically sound. Techniques such as attention mechanisms, interpretability tools, and model explanations can provide insights into the model’s decision-making process.
The BYOM feature for evaluating custom models requires building a Lambda function.
- Configure a custom layer on your Lambda function. In the GitHub release, find and download the pre-built nova-custom-eval-layer.zip file.
- Use the following command to upload the custom Lambda layer:
- Add the published layer and
AWSLambdaPowertoolsPythonV3-python312-arm64(or similar AWS layer based on Python version and runtime version compatibility) to your Lambda function to ensure all necessary dependencies are installed. - For development of the Lambda function, import two key dependencies: one for importing the preprocessor and postprocessor decorators and one to build the
lambda_handler:
- Add the preprocessor and postprocessor logic.
- Preprocessor logic: Implement functions that manipulate the data before it is passed to the inference server. This can include prompt manipulations or other data preprocessing steps. The pre-processor expects an event dictionary (dict), a sequence of key value pairs, as input:
Example:
- Postprocessor logic: Implement functions that process the inference results. This can involve parsing fields, adding custom validations, or calculating specific metrics. The postprocessor expects an event dict as input which has this format:
- Preprocessor logic: Implement functions that manipulate the data before it is passed to the inference server. This can include prompt manipulations or other data preprocessing steps. The pre-processor expects an event dictionary (dict), a sequence of key value pairs, as input:
- Define the Lambda handler, where you add the pre-processor and post-processor logics, before and after inference respectively.

Step 4: Launch the evaluation job with custom metrics
Now that you have built your custom processors and encoded your evaluation metrics, you can choose a recipe and make necessary adjustments to make sure the previous BYOM logic gets executed. For this, first choose bring your own data recipes from the public repo, and make sure the following code changes are made.
- Make sure that the processor key is added on to the recipe with correct details:
- lambda-arn: The Amazon Resource Name (ARN) for a customer Lambda function that handles pre-processing and post-processing
- preprocessing: Whether to add custom pre-processing operations
- post-processing: Whether to add custom post-processing operations
- aggregation: In-built aggregation function to choose from.
min, max, average, or sum
- Launch a training job with an evaluation container:
Step 5: Use metadata and log probabilities to calibrate the accuracy
You can also include log probability as an inference config variable to help conduct logit-based evaluations. For this we can pass top_logprobs under inference in the recipe:
top_logprobs indicates the number of most likely tokens to return at each token position each with an associated log probability. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the content of message.
Once the job runs successfully and you have the results, you can find the log probabilities under the field pred_logprobs. This field contains the considered output tokens and log probabilities of each output token returned in the content of message. You can now use the logits produced to do calibration for your classification task. The log probabilities of each output token can be useful for calibration, to adjust the predictions and treat these probabilities as confidence score.
Step 6: Failure analysis on low confidence prediction
After calibrating our model using metadata and log probabilities, we can now identify and analyze failure patterns in low-confidence predictions. This analysis helps us understand where our model struggles and guides targeted improvements.
Loading results with log probabilities
Now, let’s examine in detail how we combine the inference outputs with detailed log probability data from the Amazon Nova evaluation pipeline. This helps us perform confidence-aware failure analysis by merging the prediction results with token-level uncertainty information.
Generate a confidence score from log probabilities by converting the logprobs to probabilities and using the score of the first token in the classification response. We only use the first token as we know subsequent tokens in the classification would align the class label. This step creates downstream quality gates in which we could route low confidence scores to human review, have a view into model uncertainty to validate if the model is “guessing,” preventing hallucinations from reaching users, and later enables stratified analysis.
Initial analysis
Next, we perform stratified failure analysis, which combines confidence scores with metadata categories to identify specific failure patterns. This multi-dimensional analysis reveals failure modes across different task types, difficulty levels, and domains. Stratified failure analysis systematically examines low-confidence predictions to identify specific patterns and root causes. It first filters predictions below the confidence threshold, then conducts multi-dimensional analysis across metadata categories to pinpoint where the model struggles most. We also analyze content patterns in failed predictions, looking for uncertainty language and categorizing error types (JSON format issues, length problems, or content errors) before generating insights that tell teams exactly what to fix.
Preview initial results
Now let’s review our initial results displaying what was parsed out.
Step 7: Scale the evaluations on multi-node prediction
After identifying failure patterns, we need to scale our evaluation to larger datasets for testing. Nova evaluation containers now support multi-node evaluation to improve throughput and speed by configuring the number of replicas needed in the recipe.
The Nova evaluation container handles multi-node scaling automatically when you specify more than one replica in your evaluation recipe. Multi-node scaling distributes the workload across multiple nodes while maintaining the same evaluation quality and metadata passthrough capabilities.
Result aggregation and performance analysis
The Nova evaluation container automatically handles result aggregation from multiple replicas, but we can analyze the scaling effectiveness and limit metadata-based analysis to the distributed evaluation.
Multi-node evaluation uses the Nova evaluation container’s built-in capabilities through the replicas parameter, distributing workloads automatically and aggregating results while keeping all metadata-based stratified analysis capabilities. The container handles the complexity of distributed processing, helping teams to scale from thousands to millions of examples by increasing the replica count.
Conclusion
This example demonstrated Nova model evaluation basics showing the capabilities of new feature releases for the Nova evaluation container. We showed how us usage of custom metrics (BYOM) with domain-specific assessments can drive deep insights. Then explained how to extract and use log probabilities to reveal model uncertainty easing the implementation of quality gates and confidence-based routing. Then showed how the metadata passthrough capability is used downstream for stratified analysis, pinpointing where models struggle and where to focus improvements. We then identified a simple approach to scale these strategies with multi-node evaluation capabilities. Including these features in your evaluation pipeline can help you make informed decisions on which models to adopt and where customization should be applied.
Get started now with the Nova evaluation demo notebook which has detailed executable code for each step above, from dataset preparation through failure analysis, giving you a baseline to modify so you can evaluate your own use case.
Check out the Amazon Nova Samples repository for complete code examples across a variety of use cases.
About the authors
Tony Santiago is a Worldwide Partner Solutions Architect at AWS, dedicated to scaling generative AI adoption across Global Systems Integrators. He specializes in solution building, technical go-to-market alignment, and capability development—enabling tens of thousands of builders at GSI partners to deliver AI-powered solutions for their customers. Drawing on more than 20 years of global technology experience and a decade with AWS, Tony champions practical technologies that drive measurable business outcomes. Outside of work, he’s passionate about learning new things and spending time with family.
Akhil Ramaswamy is a Worldwide Specialist Solutions Architect at AWS, specializing in advanced model customization and inference on SageMaker AI. He partners with global enterprises across various industries to solve complex business problems using the AWS generative AI stack. With expertise in building production-grade agentic systems, Akhil focuses on developing scalable go-to-market solutions that help enterprises drive innovation while maximizing ROI. Outside of work, you can find him traveling, working out, or enjoying a nice book.
Anupam Dewan is a Senior Solutions Architect working in Amazon Nova team with a passion for generative AI and its real-world applications. He focuses on building, enabling, and benchmarking AI applications for GenAI customers in Amazon. With a background in AI/ML, data science, and analytics, Anupam helps customers learn and make Amazon Nova work for their GenAI use cases to deliver business results. Outside of work, you can find him hiking or enjoying nature.


