Adapting MLLMs for Nuanced Video Retrieval

Piyush Bagad; Andrew Zisserman

arXiv:2512.13511·cs.CV·May 4, 2026

Adapting MLLMs for Nuanced Video Retrieval

Piyush Bagad, Andrew Zisserman

PDF

1 Models

TL;DR

This paper develops a unified multimodal embedding model for nuanced video retrieval, effectively capturing temporal, negation, and multimodal nuances through text-only training, achieving state-of-the-art results.

Contribution

It introduces a contrastively fine-tuned, text-only trained MLLM-based embedding model that handles complex video retrieval nuances without requiring video data during training.

Findings

01

Achieves state-of-the-art performance on nuanced video retrieval benchmarks.

02

Text-only training reduces the modality gap between text and video embeddings.

03

Effectively captures temporal, negation, and multimodal nuances in retrieval tasks.

Abstract

Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bpiyush/TARA
model· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.