TL;DR
This paper develops a unified multimodal embedding model for nuanced video retrieval, effectively capturing temporal, negation, and multimodal nuances through text-only training, achieving state-of-the-art results.
Contribution
It introduces a contrastively fine-tuned, text-only trained MLLM-based embedding model that handles complex video retrieval nuances without requiring video data during training.
Findings
Achieves state-of-the-art performance on nuanced video retrieval benchmarks.
Text-only training reduces the modality gap between text and video embeddings.
Effectively captures temporal, negation, and multimodal nuances in retrieval tasks.
Abstract
Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
