Unsupervised Corpus Aware Language Model Pre-training for Dense Passage   Retrieval

Luyu Gao; Jamie Callan

arXiv:2108.05540·cs.IR·August 13, 2021·54 cites

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Luyu Gao, Jamie Callan

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces coCondenser, an unsupervised pre-training method that enhances dense passage retrieval by improving robustness to data noise and reducing the need for large batches and extensive data engineering.

Contribution

It proposes coCondenser, a novel unsupervised pre-training approach that improves dense retriever training stability and performance without heavy engineering or large batch requirements.

Findings

01

coCondenser achieves comparable results to state-of-the-art systems

02

It reduces reliance on data augmentation and filtering

03

It enables effective training with small batches

Abstract

Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luyug/Condenser
pytorchOfficial

Models

🤗
sentence-transformers/msmarco-bert-co-condensor
model· 234 dl· ♡ 4
234 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications