DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers

Navve Wasserman; Oliver Heinimann; Yuval Golbari; Tal Zimbalist; Eli Schwartz; Michal Irani

arXiv:2505.22584·cs.IR·May 29, 2025

DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers

Navve Wasserman, Oliver Heinimann, Yuval Golbari, Tal Zimbalist, Eli Schwartz, Michal Irani

PDF

Open Access 1 Models 3 Datasets

TL;DR

This paper introduces a novel method for training multimodal rerankers by generating hard negative queries per page using an LLM-VLM pipeline, leading to improved retrieval performance.

Contribution

It proposes a new approach to hard negative mining by generating negative queries per page, enhancing diversity and control in training data for rerankers.

Findings

01

Generated negatives are more diverse and harder.

02

Rerankers trained with generated negatives outperform existing models.

03

Significant improvements in retrieval performance achieved.

Abstract

Rerankers play a critical role in multimodal Retrieval-Augmented Generation (RAG) by refining ranking of an initial set of retrieved documents. Rerankers are typically trained using hard negative mining, whose goal is to select pages for each query which rank high, but are actually irrelevant. However, this selection process is typically passive and restricted to what the retriever can find in the available corpus, leading to several inherent limitations. These include: limited diversity, negative examples which are often not hard enough, low controllability, and frequent false negatives which harm training. Our paper proposes an alternative approach: Single-Page Hard Negative Query Generation, which goes the other way around. Instead of retrieving negative pages per query, we generate hard negative queries per page. Using an automated LLM-VLM pipeline, and given a page and its positive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
DocReRank/DocReRank-Reranker
model· ♡ 5
♡ 5

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training