Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan

TL;DR
This paper introduces Vote-in-Context (ViC), a training-free, zero-shot framework that enhances multi-modal retrieval by rethinking list-wise reranking and fusion as a reasoning task within vision-language models, achieving state-of-the-art results.
Contribution
The paper presents ViC, a novel zero-shot, training-free approach that serializes content and metadata into prompts for VLMs to improve multi-modal retrieval performance.
Findings
ViC significantly improves retrieval precision across benchmarks.
Achieves up to +40 Recall@1 over previous methods.
Establishes new state-of-the-art zero-shot retrieval results.
Abstract
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
