SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin, Jiang, Rangan Majumder, Furu Wei

TL;DR
SimLM introduces a simple pre-training method with a representation bottleneck and a replaced language modeling objective, significantly improving dense passage retrieval performance without requiring labeled data.
Contribution
It proposes a novel pre-training approach that enhances dense passage retrieval by compressing information into dense vectors using self-supervised learning.
Findings
Outperforms strong baselines on large-scale datasets
Surpasses multi-vector approaches like ColBERTv2 in effectiveness
Requires only unlabeled data, broadening applicability
Abstract
In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · Dropout · Linear Warmup With Linear Decay · Weight Decay
