TL;DR
ZipRerank is a highly efficient multimodal reranker that significantly reduces inference latency in long document retrieval tasks by innovative input reduction and single-pass scoring, matching or surpassing state-of-the-art accuracy.
Contribution
It introduces ZipRerank, a novel reranking method that addresses computational bottlenecks with a lightweight interaction mechanism and a two-stage training strategy.
Findings
Matches or surpasses state-of-the-art accuracy in multimodal reranking.
Reduces LLM inference latency by up to an order of magnitude.
Effective for latency-sensitive real-world systems.
Abstract
Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
