Domain-matched Pre-training Tasks for Dense Retrieval
Barlas O\u{g}uz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis,, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau, Yih, Sonal Gupta, Yashar Mehdad

TL;DR
This paper demonstrates that domain-matched pre-training tasks significantly improve dense retrieval performance by using large-scale synthetic questions and Reddit data, overcoming previous limitations in pre-training for retrieval tasks.
Contribution
It introduces a novel pre-training setup tailored for dense retrieval, utilizing large domain-specific datasets to achieve substantial performance gains.
Findings
Pre-training on synthetic questions improves retrieval accuracy.
Using Reddit post-comment pairs enhances dialogue retrieval performance.
The approach outperforms supervised baselines on multiple benchmarks.
Abstract
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
