CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval
David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

TL;DR
CLaMR is a novel multimodal video retrieval system that dynamically selects relevant modalities using a unified encoding approach, trained on a large synthetic dataset, significantly outperforming existing methods in accuracy and downstream tasks.
Contribution
Introduces CLaMR, a late-interaction multimodal retriever with a unified backbone and modality-aware training, enabling dynamic modality selection for improved video retrieval.
Findings
CLaMR outperforms existing retrievers by 25.6 nDCG@10 on MultiVENT 2.0++.
CLaMR achieves a 35.4 improvement over multi-modality baselines.
Demonstrates utility in long-video question answering with notable performance boosts.
Abstract
Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack…
Peer Reviews
Decision·Submitted to ICLR 2026
The joint encoding of multiple modalities and modality-wise late-interaction balance fine-grained token-level matching and computational efficiency, filling the gap of late-interaction’s underutilization in multimodal video retrieval. Evaluates on multiple benchmarks and conducts extensive ablations, ensuring the reliability of conclusions.
Focuses only on four modalities and does not explore other critical modalities in video content. Evaluations are primarily based on event-centric and general video datasets. While the joint encoding backbone aims to align modalities, the paper lacks analysis of alignment failures in temporally or semantically misaligned video content. The model’s performance heavily relies on high-quality ASR transcripts and OCR text, making it vulnerable to low-resource scenarios where these modalities are no
1. Unlike baselines that encode modalities separately, the proposed model uses a unified backbone for cross-modal contextualization. 2. The proposed MULTIVENT 2.0++ fills the gap of modality-specific training data, supporting effective modality selection learning. 3. The proposed model consistently outperforms baselines across MULTIVENT 2.0++, MSR-VTT.
1. The performance drops noticeably when vision is the sole relied-on modality, showing weaker handling of visual-only signals. Why is this? Why is "vision the least informative," as stated in Line 430? Could this be unfriendly to most scenarios (given that vision is the most common and readily available modality)? 2. Primary evaluations focus on MULTIVENT 2.0++ and MSRVTT; tests on other multimodal benchmarks (e.g., MSVD, DiDeMo, ActivityNet) are limited, reducing generalizability evidence. 3
1. The paper tackles a well-recognized and challenging problem in multimodal retrieval—how to effectively combine signals from diverse and potentially noisy sources without performance degradation. 2. The application of late-interaction mechanisms from the text domain to a complex multimodal video scenario is a logical and interesting direction. It provides an alternative to more common early-fusion or simple late-fusion (score averaging) techniques. 3. The authors make a notable effort to add
1. The core technical contribution can be viewed as an application of the existing ColBERT architecture to a multimodal setting using a standard VLM backbone. While the engineering is non-trivial, the conceptual novelty is somewhat incremental, as it primarily combines and adapts existing components rather than introducing a fundamentally new retrieval paradigm. 2. The most impressive results (e.g., +25.6 nDCG@10) are reported on the authors' own synthetic dataset, MULTIVENT 2.0++. This raises
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
