Remining Hard Negatives for Generative Pseudo Labeled Domain Adaptation

Goksenin Yuksel; David Rau; Jaap Kamps

arXiv:2501.14434·cs.IR·January 27, 2025

Remining Hard Negatives for Generative Pseudo Labeled Domain Adaptation

Goksenin Yuksel, David Rau, Jaap Kamps

PDF

Open Access

TL;DR

This paper improves domain adaptation for neural information retrieval by analyzing and refreshing hard negatives during training, significantly boosting performance across multiple datasets.

Contribution

It introduces a novel hard-negative re-mining method during GPL training, enhancing the robustness of dense retrievers in cross-domain settings.

Findings

01

Boosts ranking performance in 13/14 BEIR datasets

02

Improves results in 9/12 LoTTe datasets

03

Analyzes the impact of hard negatives on domain adaptation

Abstract

Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. A state-of-the-art domain adaptation technique is Generative Pseudo Labeling (GPL). GPL uses synthetic query generation and initially mined hard negatives to distill knowledge from cross-encoder to dense retrievers in the target domain. In this paper, we analyze the documents retrieved by the domain-adapted model and discover that these are more relevant to the target queries than those of the non-domain-adapted model. We then propose refreshing the hard-negative index during the knowledge distillation phase to mine better hard negatives. Our remining R-GPL approach boosts ranking performance in 13/14 BEIR datasets and 9/12 LoTTe datasets. Our contributions are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research

MethodsKnowledge Distillation