VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Issar Tzachor; Dvir Samuel; Rami Ben-Ari

arXiv:2602.08099·cs.CV·February 10, 2026

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Issar Tzachor, Dvir Samuel, Rami Ben-Ari

PDF

Open Access

TL;DR

This paper introduces VidVec, a method that leverages pre-trained multimodal large language models for zero-shot video-text retrieval, outperforming existing approaches without additional training.

Contribution

The study reveals that intermediate MLLM layers contain rich task-relevant info and proposes a lightweight, training-free alignment strategy for video-text embedding.

Findings

01

Achieves state-of-the-art zero-shot video retrieval performance.

02

Outperforms existing methods by a substantial margin.

03

No fine-tuning beyond text is required.

Abstract

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling