Domain-matched Pre-training Tasks for Dense Retrieval

Barlas O\u{g}uz; Kushal Lakhotia; Anchit Gupta; Patrick Lewis,; Vladimir Karpukhin; Aleksandra Piktus; Xilun Chen; Sebastian Riedel; Wen-tau; Yih; Sonal Gupta; Yashar Mehdad

arXiv:2107.13602·cs.CL·July 30, 2021·1 cites

Domain-matched Pre-training Tasks for Dense Retrieval

Barlas O\u{g}uz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis,, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau, Yih, Sonal Gupta, Yashar Mehdad

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that domain-matched pre-training tasks significantly improve dense retrieval performance by using large-scale synthetic questions and Reddit data, overcoming previous limitations in pre-training for retrieval tasks.

Contribution

It introduces a novel pre-training setup tailored for dense retrieval, utilizing large domain-specific datasets to achieve substantial performance gains.

Findings

01

Pre-training on synthetic questions improves retrieval accuracy.

02

Using Reddit post-comment pairs enhances dialogue retrieval performance.

03

The approach outperforms supervised baselines on multiple benchmarks.

Abstract

Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/dpr-scale
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications