VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
Issar Tzachor, Dvir Samuel, Rami Ben-Ari

TL;DR
This paper introduces VidVec, a method that leverages pre-trained multimodal large language models for zero-shot video-text retrieval, outperforming existing approaches without additional training.
Contribution
The study reveals that intermediate MLLM layers contain rich task-relevant info and proposes a lightweight, training-free alignment strategy for video-text embedding.
Findings
Achieves state-of-the-art zero-shot video retrieval performance.
Outperforms existing methods by a substantial margin.
No fine-tuning beyond text is required.
Abstract
Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
