Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vuli\'c, Iryna, Gurevych

TL;DR
This paper introduces a cooperative retrieve-and-rerank framework that enhances cross-modal retrieval by combining efficient bi-encoders for initial retrieval with a cross-encoder for refined ranking, achieving better accuracy and efficiency.
Contribution
It presents a novel fine-tuning approach that transforms pretrained multi-modal models into efficient retrieval systems using shared-weight bi-encoders and cross-encoders.
Findings
Improved retrieval accuracy across multiple benchmarks.
Significant reduction in retrieval latency.
Effective joint fine-tuning of components enhances performance.
Abstract
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross-modal retrieval, we propose a novel fine-tuning framework that turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach which combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items of a corpus, enabling efficient initial retrieval,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
