Retrieval Oriented Masking Pre-training Language Model for Dense Passage   Retrieval

Dingkun Long; Yanzhao Zhang; Guangwei Xu; Pengjun Xie

arXiv:2210.15133·cs.CL·October 28, 2022·1 cites

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

Dingkun Long, Yanzhao Zhang, Guangwei Xu, Pengjun Xie

PDF

Open Access 1 Repo

TL;DR

This paper introduces a retrieval-oriented masking strategy for pre-training language models, emphasizing important tokens to improve dense passage retrieval performance without altering the original model architecture.

Contribution

It proposes a novel masking method that prioritizes important tokens during pre-training, enhancing the model's ability to capture key information for retrieval tasks.

Findings

01

ROM improves retrieval benchmark performance

02

Prioritized masking captures essential term importance

03

Enhanced language models facilitate better passage retrieval

Abstract

Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba-nlp/multi-cpr
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications