Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval
Guangyuan Ma, Xing Wu, Peng Wang, Zijia Lin, Songlin Hu

TL;DR
This paper explores pre-training methods using Large Language Model-based document expansion to improve dense passage retrieval, demonstrating significant performance gains and strong zero-shot capabilities.
Contribution
It introduces novel pre-training strategies leveraging LLMs for document expansion, including contrastive learning and curriculum learning, enhancing retrieval performance without labeled data.
Findings
Significant boost in retrieval performance on large-scale web-search tasks.
Strong zero-shot and out-of-domain retrieval capabilities.
Effective reduction of LLM inference reliance through curriculum learning.
Abstract
In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsContrastive Learning
