UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

Yupei Yang; Lin Yang; Wanxi Deng; Lin Qu; Shikui Tu; Lei Xu

arXiv:2603.29897·cs.IR·April 1, 2026

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Shikui Tu, Lei Xu

PDF

TL;DR

UniRank is a novel domain-specific multimodal reranking framework that directly scores hybrid text-image candidates, improving retrieval performance without modality conversion or extensive domain data.

Contribution

It introduces a unified scoring interface and an end-to-end domain adaptation pipeline for effective multimodal reranking in domain-specific scenarios.

Findings

01

Outperforms state-of-the-art baselines in scientific literature retrieval and patent search.

02

Improves Recall@1 by 8.9% and 7.3% respectively.

03

Demonstrates effective cross-modal relevance scoring without modality conversion.

Abstract

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.