Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Nandan Thakur; Crystina Zhang; Xueguang Ma; Jimmy Lin

arXiv:2505.16967·cs.IR·October 21, 2025

Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin

PDF

Open Access 5 Datasets 1 Video

TL;DR

This paper investigates the impact of training data quality on retrieval models, demonstrating that relabeling false negatives with LLMs enhances model performance and emphasizing the importance of data curation.

Contribution

It introduces a cost-effective method using LLMs to identify and relabel false negatives in training data, improving retrieval model effectiveness.

Findings

01

Relabeling false negatives improves retrieval performance by up to 1.8 points in nDCG@10.

02

Pruning datasets can increase effectiveness despite reducing data size.

03

LLMs reliably identify false negatives, validated by human annotation.

Abstract

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35 $\times$ , surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7 $\unicode x 2013$ 1.4 points on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus · Pruning · Sparse Evolutionary Training