Amazon SageMaker Ground Truth significantly reduces the cost and time required for labeling data by integrating human annotators with machine learning to automate the labeling process. You can use SageMaker Ground Truth to create labeling jobs, which are workflows where data objects (such as images, videos, or documents) need to be annotated by human workers. These labeling jobs are distributed among a workteam—a group of workers assigned to perform the annotations. To access the data objects they need to label, workers are provided with Amazon S3 presigned URLs.
A presigned URL is a temporary URL that grants time-limited access to an Amazon Simple Storage Service (Amazon S3) object. In the context of SageMaker Ground Truth, these presigned URLs are generated using the grant_read_access Liquid filter and embedded into the task templates. Workers can then use these URLs to directly access the necessary files, such as images or documents, in their web browsers for annotation purposes.
While presigned URLs offer a convenient way to grant temporary access to S3 objects, sharing these URLs with people outside of the workteam can lead to unintended access of those objects. To mitigate this risk and enhance the security of SageMaker Ground Truth labeling tasks, we have introduced a new feature that adds an additional layer of security by restricting access to the presigned URLs to the worker’s IP address or virtual private cloud (VPC) endpoint from which they access the labeling task. In this blog post, we show you how to enable this feature, allowing you to enhance your data security as needed, and outline the success criteria for this feature, including the scenarios where it will be most beneficial.
Prerequisites
Before you get started configuring IP-restricted presigned URLs, the following resources can help you understand the background concepts:
Amazon S3 presigned URL: This documentation covers the use of Amazon S3 presigned URLs, which provide temporary access to objects. Understanding how presigned URLs work will be beneficial.
Use Amazon SageMaker Ground Truth to label data: This guide explains how to use SageMaker Ground Truth for data labeling tasks, including setting up workteams and workforces. Familiarity with these concepts will be helpful when configuring IP restrictions for your workteams.
Introducing IP-restricted presigned URLs
Working closely with our customers, we recognized the need for enhanced security posture and stricter access controls to presigned URLs. So, we introduced a new feature that uses AWS global condition context keys aws:SourceIp and aws:VpcSourceIp to allow customers to restrict presigned URL access to specific IP addresses or VPC endpoints. By incorporating AWS Identity and Access Management (IAM) policy constraints, you can now restrict presigned URLs to only be accessible from an IP address or VPC endpoint of your choice. This IP-based access control effectively locks down the presigned URL to the worker’s location, mitigating the risk of unauthorized access or unintended sharing.
Benefits of the new feature
This update brings several significant security benefits to SageMaker Ground Truth:
Enhanced data privacy: These IP restrictions restrict presigned URLs to only be accessible from customer-approved locations, such as corporate VPNs, workers’ home networks, or designated VPC endpoints. Although the presigned URLs are pre-authenticated, this feature adds an additional layer of security by verifying the access location and locking the URL to that location until the task is completed.
Reduced risk of unauthorized access: Enforcing IP-based access controls minimizes the risk of data being accessed from unauthorized locations and mitigates the risk of data sharing outside the worker’s approved access network. This is particularly important when dealing with sensitive or confidential data.
Flexible security options: You can apply these restrictions in either VPC or non-VPC settings, allowing you to tailor security measures to your organization’s specific needs.
Auditing and compliance: By locking down presigned URLs to specific IP addresses or VPC endpoints, you can more easily track and audit access to your organization’s data, helping achieve compliance with internal policies and external regulations.
Seamless integration: This new feature seamlessly integrates with existing SageMaker Ground Truth workflows, providing enhanced security without disrupting established labeling processes or requiring significant changes to existing infrastructure.
By introducing IP-Restricted presigned URLs, SageMaker Ground Truth empowers you with greater control over data access, so sensitive information remains accessible only to authorized workers within approved locations.
Configuring IP-restricted presigned URLs for SageMaker Ground Truth
The new IP restriction feature for presigned URLs in SageMaker Ground Truth can be enabled through the SageMaker API or the AWS Command Line Interface (AWS CLI). Before we go into the configuration of this new feature, let’s look at how you can create and update workteams today using the AWS CLI. You can also perform these operations through the SageMaker API using the AWS SDK.
Here’s an example of creating a new workteam using the create-workteam command:
aws sagemaker create-workteam
–description “A team for image labeling tasks”
–workforce-name “default”
–workteam-name “MyWorkteam”
–member-definitions ‘{
“CognitoMemberDefinition”: {
“ClientId”: “exampleclientid”,
“UserGroup”: “sagemaker-groundtruth-user-group”,
“UserPool”: “us-west-2_examplepool”
}
}’
To update an existing workteam, you use the update-workteam command:
aws sagemaker update-workteam
–workteam-name “MyWorkteam”
–description “Updated description for image labeling tasks”
Note that these examples only show a subset of the available parameters for the create-workteam and update-workteam APIs. You can find detailed documentation and examples in the SageMaker Ground Truth Developer Guide.
Enabling IP restrictions for presigned URLs
With the new IP restriction feature, you can now configure IP-based access constraints specific to each workteam when creating a new workteam or modifying an existing one. Here’s how you can enable these restrictions:
When creating or updating a workteam, you can specify a WorkerAccessConfiguration object, which defines access constraints for the workers in that workteam.
Within the WorkerAccessConfiguration, you can include an S3Presign object, which allows you to set access configurations for the presigned URLs used by the workers. Currently, only IamPolicyConstraints can be added to the S3Presign SageMaker Ground Truth provides two Liquid filters that you can use in your custom worker task templates to generate presigned URLs:
grant_read_access: This filter generates a presigned URL for the specified S3 object, granting temporary read access. The command will look like:
s3_presign: This new filter serves the same purpose as grant_read_access but makes it clear that the generated URL is subject to the S3Presign configuration defined for the workteam. The command will look like:
The S3Presign object supports IamPolicyConstraints, where you can enable or disable the SourceIp and VpcSourceIp
SourceIp: When enabled, workers can access presigned URLs only from the specified IP addresses or ranges.
VpcSourceIp: When enabled, workers can access presigned URLs only from the specified VPC endpoints within your AWS account.
You can call the SageMaker ListWorkteams or DescribeWorkteam APIs to view workteams’ metadata, including the WorkerAccessConfiguration.
Let’s say you want to create or update a workteam so that presigned URLs will be restricted to the public IP address of the worker who originally accessed it.
Create workteam:
aws sagemaker create-workteam
–description “An example workteam with S3 presigned URLs restricted”
–workforce-name “default”
–workteam-name “exampleworkteam”
–member-definitions ‘{
“CognitoMemberDefinition”: {
“ClientId”: “exampleclientid”,
“UserGroup”: “sagemaker-groundtruth-user-group”,
“UserPool”: “us-west-2_examplepool”
}
}’
–worker-access-configuration ‘{
“S3Presign”: {
“IamPolicyConstraints”: {
“SourceIp”: “Enabled”,
“VpcSourceIp”: “Disabled”
}
}
}’
Update workteam:
aws sagemaker update-workteam
–workteam-name “existingworkteam”
–worker-access-configuration ‘{
“S3Presign”: {
“IamPolicyConstraints”: {
“SourceIp”: “Enabled”,
“VpcSourceIp”: “Disabled”
}
}
}’
Success criteria
While the IP-restricted presigned URLs feature provides enhanced security, there are scenarios where it might not be suitable. Understanding these limitations can help you make an informed decision about using the feature and verify that it aligns with your organization’s security needs and network configurations.
IP-restricted presigned URLs are effective in scenarios where there’s a consistent IP address used by the worker accessing SageMaker Ground Truth and the S3 object. For example, if a worker accesses labeling tasks from a stable public IP address, such as an office network with a fixed IP address, the IP restriction will provide access with enhanced security. Similarly, when a worker accesses both SageMaker Ground Truth and S3 objects through the same VPC endpoint, the IP restriction will verify that the presigned URL is only accessible from within this VPC. In both scenarios, the consistent IP address enables the IP-based access controls to function correctly, providing an additional layer of security.
Scenarios where IP-restricted presigned URLs aren’t effective
Scenario
Description
Example
Exit criteria
Asymmetric VPC endpoints
SageMaker Ground Truth is accessed through a public internet connection while Amazon S3 is accessed through a VPC endpoint, or vice versa.
Worker accesses SageMaker Ground Truth through the public internet but S3 through a VPC endpoint.
Verify that both SageMaker Ground Truth and S3 are accessed either entirely through the public internet or entirely through the same VPC endpoint.
Network Address Translation (NAT) layers
NAT layers can alter the source IP address of requests, causing IP mismatches. Issues can arise from dynamically assigned IP addresses or asymmetric configurations.
Examples include:
N-to-M IP translation where multiple internal IP addresses are translated to multiple public IP addresses.
A NAT gateway with multiple public IP addresses assigned to it, which can cause requests to appear from different IP addresses.
Shared IP addresses where multiple users’ traffic is routed through a single public IP address, making it difficult to enforce IP-based restrictions effectively.
Verify that the NAT gateway is configured to preserve the source IP address. Validate the NAT configuration for consistency when accessing both SageMaker Ground Truth and S3 resources.
Use of VPNs
VPNs change the outgoing IP address, leading to potential access issues with IP-restricted presigned URLs.
Worker uses a split-tunnel VPN that changes IP address for different requests to Ground Truth or S3, access might be denied.
Disable the VPN or use a full tunnel VPN that offers consistent IP address for all requests.
Interface endpoints aren’t supported by the grant_read_access feature because of their inability to resolve public DNS names. This limitation is orthogonal to the IP restrictions and should be considered when configuring your network setup for accessing S3 objects with presigned URLs. In such cases, use the S3 Gateway endpoint when accessing S3 to verify compatibility with the public DNS names generated by grant_read_access.
Using S3 access logs for debugging
To debug issues related to IP-restricted presigned URLs, S3 access logs can provide valuable insights. By enabling access logging for your S3 bucket, you can track every request made to your S3 objects, including the IP addresses from which the requests originate. This can help you identify:
Mismatches between expected and actual IP addresses
Dynamic IP addresses or VPNs causing access issues
Unauthorized access from unexpected locations
To debug using S3 access logs, follow these steps:
Enable S3 access logging: Configure your bucket to deliver access logs to another bucket or a logging service such as Amazon CloudWatch Logs.
Review log files: Analyze the log files to identify patterns or anomalies in IP addresses, request timestamps, and error codes.
Look for IP address changes: If you observe frequent changes in IP addresses within the logs, it might indicate that the worker’s IP address is dynamic or altered by a VPN or proxy.
Check for NAT layer modifications: See if NAT layers are modifying the source IP address by checking the x-forwarded-for header in the log files.
Verify authorized access: Confirm that requests are coming from approved and consistent IP addresses by checking the Remote IP field in the log files.
By following these steps and analyzing the S3 access logs, you can validate that the presigned URLs are accessed only from approved and consistent IP addresses.
Conclusion
The introduction of IP-restricted presigned URLs in Amazon SageMaker Ground Truth significantly enhances the security of data accessed through the service. By allowing you to restrict access to specific IP addresses or VPC endpoints, this feature helps facilitate more fine-tuned control of presigned URLs. It provides organizations with added protection for their sensitive data, offering a valuable option for those with stringent security requirements. We encourage you to explore this new security feature to protect your organization’s data and enhance the overall security of your labeling workflows. To get started with SageMaker Ground Truth, visit Getting Started. To implement IP restrictions on presigned URLs as part of your workteam setup, refer to the CreateWorkteam and UpdateWorkteam API documentation. Follow the guidance provided in this blog to configure these security measures effectively. For more information or assistance, contact your AWS account team or visit the SageMaker community forums.
About the Authors
Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers build scalable and cost-efficient AI/ML pipelines with Human in the Loop services. In his free time, Sundar loves traveling, sports and enjoying outdoor activities with his family.
Michael Borde is a lead software engineer at Amazon AI, where he has been for seven years. He previously studied mathematics and computer science at the University of Chicago. Michael is passionate about cloud computing, distributed systems design, and digital privacy & security. After work, you can often find Michael putzing around the local powerlifting gym in Capitol Hill.
Jacky Shum is a Software Engineer at AWS in the SageMaker Ground Truth team. He works to help AWS customers leverage machine learning applications, including prior work on ML-based fraud detection with Amazon Fraud Detector.
Rohith Kodukula is a Software Development Engineer on the SageMaker Ground Truth team. In his free time he enjoys staying active and reading up on anything that he finds mildly interesting (most things really).
Abhinay Sandeboina is a Engineering Manager at AWS Human In The Loop (HIL). He has been in AWS for over 2 years and his teams are responsible for managing ML platform services. He has a decade of experience in software/ML engineering building infrastructure platforms at scale. Prior to AWS, he worked in various engineering management roles at Zillow and Capital One.