Unlocking video understanding with TwelveLabs Marengo on Amazon Bedrock

Media and entertainment, advertising, education, and enterprise training content combines visual, audio, and motion elements to tell stories and convey information, making it far more complex than text where individual words have clear meanings. This creates unique challenges for AI systems that need to understand video content. Video content is multidimensional, combining visual elements (scenes, objects, actions), temporal dynamics (motion, transitions), audio components (dialogue, music, sound effects), and text overlays (subtitles, captions). This complexity creates significant business challenges as organizations struggle to search through video archives, locate specific scenes, categorize content automatically and extract insights from their media assets for effective decision-making.

The model addresses this problem with a multi-vector architecture that creates separate embeddings for different content modalities. Instead of forcing all information into one vector, the model generates specialized representations. This approach preserves the rich, multifaceted nature of video data, enabling more accurate analysis across visual, temporal, and audio dimensions.

Amazon Bedrock has expanded its capabilities to support the TwelveLabs Marengo Embed 3.0 model with real-time text and image processing through synchronous inference. With this integration businesses can implement faster video search functionality using natural language queries, while also supporting interactive product discovery through sophisticated image similarity matching.

In this post, we’ll show how the TwelveLabs Marengo embedding model, available on Amazon Bedrock, enhances video understanding through multimodal AI. We’ll build a video semantic search and analysis solution using embeddings from the Marengo model with Amazon OpenSearch Serverless as the vector database, for semantic search capabilities that go beyond simple metadata matching to deliver intelligent content discovery.

Understanding video embeddings

Embeddings are dense vector representations that capture the semantic meaning of data in a high-dimensional space. Think of them as numerical fingerprints that encode the essence of content in a way machines can understand and compare. For text, embeddings might capture that “king” and “queen” are related concepts, or that “Paris” and “France” have a geographical relationship. For images, embeddings can understand that a golden retriever and labrador are both dogs, even if they look different. The following heat map shows the semantic similarity scores between these sentence fragments: “two people having a conversation,” “a man and a woman talking,” and “cats and dogs are lovely animals.”

Video embeddings challenges

Video presents unique challenges because it’s inherently multimodal:

Visual information: Objects, scenes, people, actions, and visual aesthetics
Audio information: Speech, music, sound effects, and ambient noise
Textual information: Captions, on-screen text, and transcribed speech

Traditional single-vector approaches compress all this rich information into one representation, often losing important nuances. This is where the approach by TwelveLabs Marengo is unique in addressing this challenge effectively.

Twelvelabs Marengo: A multimodal embedding model

The Marengo 3.0 model generates multiple specialized vectors, each capturing different aspects of the video content. A typical movie or TV show combines visual and auditory elements to create a unified storytelling experience. Marengo’s multi-vector architecture provides significant advantages for understanding this complex video content. Each vector captures a specific modality, avoiding information loss from compressing diverse data types into single representations. This enables flexible searches targeting specific content aspects—visual-only, audio-only, or combined queries. Specialized vectors deliver superior accuracy in complex multimodal scenarios while maintaining efficient scalability for large enterprise video datasets.

Solution overview: Marengo model capabilities

In the following section, we’ll demonstrate the power of Marengo’s embedding technology through code samples. The examples illustrate how Marengo processes different types of content and deliver exceptional search accuracy. The complete code sample can be found in this GitHub repository.

Prerequisites

Before we begin, verify you have:

An AWS account with appropriate permissions.
Access to Amazon Bedrock (with the TwelveLabs Marengo model enabled)
Access to create an OpenSearch Serverless collection and index
Basic familiarity with vector databases and embeddings

Sample video

Netflix Open Content is an open source content available under the Creative Commons Attribution 4.0 International license. We will be using one of the videos called Meridian for demonstrating the TwelveLabs Marengo model on Amazon Bedrock.

Create a video embedding

Amazon Bedrock uses asynchronous API for Marengo video embedding generations. The following is a python code snippet that shows an example of invoking an API that takes a video from an S3 bucket location. Please refer to the documentation for complete supported functionality.

bedrock_client = boto3.client("bedrock-runtime")
model_id = 'us.twelvelabs.marengo-embed-3-0-v1:0'
video_s3_uri = "<s3 bucket location for the video>" # Replace by your s3 URI
aws_account_id = "<the AWS account owner for the bucket>" # Replace by bucket owner ID
s3_bucket_name = "<s3 bucket name>" # Replace by output S3 bucket name
s3_output_prefix = "<output prefix>" # Replace by output prefix

response = bedrock_client.start_async_invoke(
            modelId=model_id,
            modelInput={
                "inputType": "video",
		        "video": {
                    "mediaSource": {
                      "s3Location": {
                        "uri": video_s3_uri,
                        "bucketOwner": aws_account_id
                       }
         	     }
		 }
           },
           outputDataConfig={
               "s3OutputDataConfig": {
                  "s3Uri": f's3://{s3_bucket_name}/{s3_output_prefix}'
                }
           }
        )

The example above produces 280 individual embeddings from a single video – one for each segment, enabling precise temporal search and analysis. The type of embeddings for multi-vector output from the video could contain the following:

[ 
 {'embedding': [0.053192138671875,...], 'embeddingOption': "visual", 'embeddingScope' : "clip", "startSec" : 0.0, "endSec" : 4.3 },
 {'embedding': [0.053192138645645,...], 'embeddingOption': "transcription", 'embeddingScope' : "clip", "startSec" : 3.9, "endSec" : 6.5 },
 {'embedding': [0.3235554er443524,...], 'embeddingOption': "audio", 'embeddingScope' : "clip", "startSec" : 4.9, "endSec" : 7.5 }
]

visual – visual embeddings of the video
transcription – embeddings of the transcribed text
audio – embeddings of the audio in the video

When processing audio or video content, you can set how long each clip segment should be for embedding creation. By default, video clips are automatically divided at natural scene changes (shot boundaries). Audio clips are split into even segments that are as close to 10 seconds as possible—for example, a 50-second audio file becomes 5 segments of 10 seconds each, while a 16-second file becomes 2 segments of 8 seconds each. By default, a single Marengo video embedding API generates visual-text, visual-image, and audio embedding. You can also change the default setting to only output specific embedding types. Use the following code snippet to generate embeddings for a video with configurable options with the Amazon Bedrock API:

response = bedrock_client.start_async_invoke(
            modelId=model_id,
            modelInput={
               "modelId": model_id,
               "modelInput": {
                    "inputType": "video",
                    "video": {
                       "mediaSource": {
                            "base64String": "base64-encoded string", // base64String OR s3Location, exactly one
                            "s3Location": {
                               "uri": "s3://amzn-s3-demo-bucket/video/clip.mp4",
                               "bucketOwner": "123456789012"
                             }
                       },
                      "startSec": 0,
                      "endSec": 6,
                      "segmentation": {
                         "method": "dynamic", // dynamic OR fixed, exactly one
                         "dynamic": {
                           "minDurationSec": 4
                         }
                         "method": "fixed",
                         "fixed": {
                           "durationSec": 6
                         }
                      },
                      "embeddingOption": [
                          "visual",
                          "audio", 
                          "transcription"
                       ], // optional, default=all
                      "embeddingScope": [
                          "clip",
                          "asset"
                       ] // optional, one or both
               },
               "inferenceId": "some inference id"
            }
          }
         )

Vector database: Amazon OpenSearch Serverless

In our example, we’ll use Amazon OpenSearch Serverless as vector database for storing the text, images, audio, and video embeddings generated from the given video via Marengo model. As a vector database, OpenSearch Serverless allows you to quickly find similar content using semantic search without worrying about managing servers or infrastructure. The following code snippet demonstrates how to create an Amazon OpenSearch Serverless collection:

aoss_client = boto3_session.client('opensearchserverless')

try:
    collection = self.aoss_client.create_collection(
        name=collection_name, type='VECTORSEARCH'
    )
    collection_id = collection['createCollectionDetail']['id']
    collection_arn = collection['createCollectionDetail']['arn']
except self.aoss_client.exceptions.ConflictException:
    collection = self.aoss_client.batch_get_collection(
        names=[collection_name]
    )['collectionDetails'][0]
    pp.pprint(collection)
    collection_id = collection['id']
    collection_arn = collection['arn']

Once the OpenSearch Serverless collection is created, we’ll create an index that contains properties, including a vector field:

index_mapping = {
    "mappings": {
        "properties": {
            "video_id": {"type": "keyword"},
            "segment_id": {"type": "integer"},
            "start_time": {"type": "float"},
            "end_time": {"type": "float"},
            "embedding": {
                "type": "dense_vector",
                "dims": 1024,
                "index": True,
                "similarity": "cosine"
            },
            "metadata": {"type": "object"}
        }
    }
}
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, 'aoss')
oss_client = OpenSearch(
            hosts=[{'host': host, 'port': 443}],
            http_auth=self.awsauth,
            use_ssl=True,
            verify_certs=True,
            connection_class=RequestsHttpConnection,
            timeout=300
        )
response = oss_client.indices.create(index=index_name, body=index_mapping)

Index Marengo embeddings

The following code snippet demonstrates how to ingest the embedding output from the Marengo model into the OpenSearch index:

documents = []  
for i, segment in enumerate(video_embeddings):
    document = {
        "embedding": segment["embedding"],
        "start_time": segment["startSec"],
        "end_time": segment["endSec"],
        "video_id": video_id,
        "segment_id": i,
        "embedding_option": segment.get("embeddingOption", "visual")
    }
documents.append(document)

# Bulk index documents
bulk_data = []
for doc in documents:
    bulk_data.append({"index": {"_index": self.index_name}})
    bulk_data.append(doc)

# Convert to bulk format
bulk_body = "n".join(json.dumps(item) for item in bulk_data) + "n"
response = oss_client.bulk(body=bulk_body, index=self.index_name)

Cross-modal semantic search

With Marengo’s multi-vector design you can search across different modalities that is impossible with single-vector models. By creating separate but aligned embeddings for visual, audio, motion, and contextual elements, you can search videos using an input type of your choice. For example, “jazz music playing” returns video clips of musicians performing, jazz audio tracks, and concert hall scenes from one text query.

The following examples showcase Marengo’s exceptional search capabilities across different modalities:

Text search

Here’s a code snippet that demonstrates the cross modal semantic search capability using text:

text_query = "a person smoking in a room"
modelInput={
    "inputType": "text",
    "text": {	
       "inputText": text_query
    }
}
response = self.bedrock_client.invoke_model(
    modelId="us.twelvelabs.marengo-embed-3-0-v1:0",
    body=json.dumps(modelInput))
        
result = json.loads(response["body"].read())
query_embedding = result["data"][0]["embedding"]

# Search OpenSearch index
search_body = {
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": top_k
            }
        }
    },
    "size": top_k,
    "_source": ["start_time", "end_time", "video_id", "segment_id"]
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f"n✅ Found {len(response['hits']['hits'])} matching segments:")

results = []
for hit in response['hits']['hits']:
    result = {
        "score": hit["_score"],
        "video_id": hit["_source"]["video_id"],
        "segment_id": hit["_source"]["segment_id"],
        "start_time": hit["_source"]["start_time"],
        "end_time": hit["_source"]["end_time"]
    }
results.append(result)

The top search result from the text query: “a person smoking in a room” yields the following video clip:

Image search

The following code snippet demonstrates the cross modal semantic search capability for a given image:

s3_image_uri = f's3://{self.s3_bucket_name}/{self.s3_images_path}/{image_path_basename}'
s3_output_prefix = f'{self.s3_embeddings_path}/{self.s3_images_path}/{uuid.uuid4()}'

modelInput={
    "inputType": "image",
    "image": {
    "mediaSource": {
        "s3Location": {
            "uri": s3_image_uri,
            "bucketOwner": self.aws_account_id
        }
    }
  }
}
response = self.bedrock_client.invoke_model(
    modelId=self.cris_model_id,
    body=json.dumps(modelInput),
)

result = json.loads(response["body"].read())

...
query_embedding = result["data"][0]["embedding"]

# Search OpenSearch index
search_body = {
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": top_k
            }
        }
    },
    "size": top_k,
    "_source": ["start_time", "end_time", "video_id", "segment_id"]
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f"n✅ Found {len(response['hits']['hits'])} matching segments:")
results = []

for hit in response['hits']['hits']:
    result = {
        "score": hit["_score"],
        "video_id": hit["_source"]["video_id"],
        "segment_id": hit["_source"]["segment_id"],
        "start_time": hit["_source"]["start_time"],
        "end_time": hit["_source"]["end_time"]
    }
    results.append(result)

The top search result from the image above yields the following video clip:

In addition to semantic searching using text and images on the video, the Marengo model can also search videos using audio embeddings that focus on dialogue and speech. The audio search capabilities help users find videos based on specific speakers, dialogue content, or spoken topics. This creates a comprehensive video search experience that combines text, image, audio for video understanding.

Conclusion

The combination of TwelveLabs Marengo and Amazon Bedrock opens up exciting new possibilities for video understanding through its multi-vector, multimodal approach. Throughout this post, we’ve explored practical examples like image-to-video search with temporal precision and detailed text-to-video matching. With just a single Bedrock API call, we transformed one video file into 336 searchable segments that respond to text, visual, and audio queries. These capabilities create opportunities for natural language content discovery, streamlined media asset management, and other applications that can help organizations better understand and utilize their video content at scale.

As video continues to dominate digital experiences, models like Marengo provide a solid foundation for building more intelligent video analysis systems. Check out the sample code and discover how multimodal video understanding can transform your applications.

About the authors

Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.

Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.