Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Mohamed Eltahir; Ali Habibullah; Lama Ayash; Tanveer Hussain; Naeemullah Khan

arXiv:2511.01617·cs.CV·November 4, 2025

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan

PDF

Open Access

TL;DR

This paper introduces Vote-in-Context (ViC), a training-free, zero-shot framework that enhances multi-modal retrieval by rethinking list-wise reranking and fusion as a reasoning task within vision-language models, achieving state-of-the-art results.

Contribution

The paper presents ViC, a novel zero-shot, training-free approach that serializes content and metadata into prompts for VLMs to improve multi-modal retrieval performance.

Findings

01

ViC significantly improves retrieval precision across benchmarks.

02

Achieves up to +40 Recall@1 over previous methods.

03

Establishes new state-of-the-art zero-shot retrieval results.

Abstract

In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning