When Vision Meets Texts in Listwise Reranking
Hongyi Cai

TL;DR
This paper introduces Rank-Nexus, a lightweight multimodal reranker that effectively integrates visual and textual information for improved listwise reranking in information retrieval, overcoming modality gaps with a novel training strategy.
Contribution
It presents a new multimodal reranking method that uses a progressive cross-modal training strategy and a lightweight model, achieving high performance without large parameter models.
Findings
Outperforms existing rerankers on TREC and BEIR benchmarks.
Achieves strong results on image reranking benchmarks like INQUIRE and MMDocIR.
Uses only a 2B parameter pretrained visual-language model.
Abstract
Recent advancements in information retrieval have highlighted the potential of integrating visual and textual information, yet effective reranking for image-text documents remains challenging due to the modality gap and scarcity of aligned datasets. Meanwhile, existing approaches often rely on large models (7B to 32B parameters) with reasoning-based distillation, incurring unnecessary computational overhead while primarily focusing on textual modalities. In this paper, we propose Rank-Nexus, a multimodal image-text document reranker that performs listwise qualitative reranking on retrieved lists incorporating both images and texts. To bridge the modality gap, we introduce a progressive cross-modal training strategy. We first train modalities separately: leveraging abundant text reranking data, we distill knowledge into the text branch. For images, where data is scarce, we construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
