VeRVE: Versatile Retrieval for Videos via Unified Embeddings

Shaunak Halbe; Bhagyashree Puranik; Jayakrishnan Unnikrishnan; Kushan Thakkar; Vimal Bhat; Toufiq Parag

arXiv:2601.12193·cs.CV·April 20, 2026

VeRVE: Versatile Retrieval for Videos via Unified Embeddings

Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag

PDF

TL;DR

VeRVE is a versatile, MLLM-based video retrieval framework that unifies corpus, moment, and multimodal query handling, surpassing existing methods in zero-shot and reranked retrieval tasks.

Contribution

Introduces VeRVE, a unified MLLM-based model trained with LoRA that achieves state-of-the-art zero-shot and reranked video retrieval performance.

Findings

01

Outperforms other MLLM-based methods on zero-shot video retrieval.

02

Achieves competitive zero-shot moment retrieval results.

03

State-of-the-art results for zero-shot composed video retrieval.

Abstract

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.