VeRVE: Versatile Retrieval for Videos via Unified Embeddings
Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag

TL;DR
VeRVE is a versatile, MLLM-based video retrieval framework that unifies corpus, moment, and multimodal query handling, surpassing existing methods in zero-shot and reranked retrieval tasks.
Contribution
Introduces VeRVE, a unified MLLM-based model trained with LoRA that achieves state-of-the-art zero-shot and reranked video retrieval performance.
Findings
Outperforms other MLLM-based methods on zero-shot video retrieval.
Achieves competitive zero-shot moment retrieval results.
State-of-the-art results for zero-shot composed video retrieval.
Abstract
Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
