SimLM: Pre-training with Representation Bottleneck for Dense Passage   Retrieval

Liang Wang; Nan Yang; Xiaolong Huang; Binxing Jiao; Linjun Yang; Daxin; Jiang; Rangan Majumder; Furu Wei

arXiv:2207.02578·cs.IR·May 15, 2023·6 cites

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin, Jiang, Rangan Majumder, Furu Wei

PDF

Open Access 1 Repo 2 Models

TL;DR

SimLM introduces a simple pre-training method with a representation bottleneck and a replaced language modeling objective, significantly improving dense passage retrieval performance without requiring labeled data.

Contribution

It proposes a novel pre-training approach that enhances dense passage retrieval by compressing information into dense vectors using self-supervised learning.

Findings

01

Outperforms strong baselines on large-scale datasets

02

Surpasses multi-vector approaches like ColBERTv2 in effectiveness

03

Requires only unlabeled data, broadening applicability

Abstract

In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/unilm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · Dropout · Linear Warmup With Linear Decay · Weight Decay