Towards Robust Text Retrieval with Progressive Learning
Tong Wu, Yulei Qin, Enwei Zhang, Zihan Xu, Yuting Gao, Ke Li, Xing Sun

TL;DR
This paper introduces PEG, a progressive learning-based embedding model for robust text retrieval that scales training data, incorporates hard negatives, and dynamically adjusts focus during training, outperforming existing models across multiple domains.
Contribution
The paper proposes PEG, a novel embedding training method with increased negative samples, hard negatives, and a dynamic learning mechanism, improving retrieval robustness and generalization.
Findings
PEG outperforms state-of-the-art embeddings on C-MTEB and DuReader benchmarks.
Training on over 100 million diverse data improves domain coverage.
Progressive learning enhances embedding quality and retrieval accuracy.
Abstract
Retrieval augmentation has become an effective solution to empower large language models (LLMs) with external and verified knowledge sources from the database, which overcomes the limitations and hallucinations of LLMs in handling up-to-date and domain-specific information. However, existing embedding models for text retrieval usually have three non-negligible limitations. First, the number and diversity of samples in a batch are too restricted to supervise the modeling of textual nuances at scale. Second, the high proportional noise are detrimental to the semantic correctness and consistency of embeddings. Third, the equal treatment to easy and difficult samples would cause sub-optimum convergence of embeddings with poorer generalization. In this paper, we propose the PEG, a progressively learned embeddings for robust text retrieval. Specifically, we increase the training in-batch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
