Indexing Multimodal Language Models for Large-scale Image Retrieval
Bahey Tharwat, Giorgos Kordopatis-Zilos, Pavel Suma, Ian Reid, Giorgos Tolias

TL;DR
This paper explores using Multimodal Large Language Models as zero-shot similarity estimators for large-scale image retrieval, demonstrating their robustness and potential as an alternative to task-specific re-rankers.
Contribution
It introduces a novel approach that prompts MLLMs with paired images to perform zero-shot image re-ranking without fine-tuning or specialized architectures.
Findings
MLLMs outperform task-specific re-rankers outside their native domains.
MLLMs show robustness to clutter, occlusion, and small objects.
The approach enables scalable, training-free large-scale image retrieval.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top- candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
