SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval

Xiaopeng Li; Xiangyang Li; Hao Zhang; Zhaocheng Du; Pengyue Jia,; Yichao Wang; Xiangyu Zhao; Huifeng Guo; Ruiming Tang

arXiv:2412.17250·cs.IR·December 24, 2024

SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval

Xiaopeng Li, Xiangyang Li, Hao Zhang, Zhaocheng Du, Pengyue Jia,, Yichao Wang, Xiangyu Zhao, Huifeng Guo, Ruiming Tang

PDF

Open Access

TL;DR

This paper introduces SyNeg, a novel framework that uses large language models to generate high-quality synthetic hard negatives, significantly enhancing dense retrieval performance and training stability.

Contribution

We propose a multi-attribute self-reflection prompting strategy and a hybrid sampling method leveraging LLMs to synthesize effective hard negatives for dense retrieval.

Findings

01

Improved retrieval accuracy on five benchmark datasets.

02

Enhanced training stability with synthetic hard negatives.

03

Demonstrated the effectiveness of LLM-generated negatives in dense retrieval.

Abstract

The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling. Traditional DR methods primarily depend on naive negative sampling techniques or on mining hard negatives through external retriever and meticulously crafted strategies. However, naive negative sampling often fails to adequately capture the accurate boundaries between positive and negative samples, whereas existing hard negative sampling methods are prone to false negatives, resulting in performance degradation and training instability. Recent advancements in large language models (LLMs) offer an innovative solution to these challenges by generating contextually rich and diverse negative samples. In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples. We first devise a \textit{multi-attribute self-reflection prompting strategy} to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques