Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval
Dingkun Long, Yanzhao Zhang, Guangwei Xu, Pengjun Xie

TL;DR
This paper introduces a retrieval-oriented masking strategy for pre-training language models, emphasizing important tokens to improve dense passage retrieval performance without altering the original model architecture.
Contribution
It proposes a novel masking method that prioritizes important tokens during pre-training, enhancing the model's ability to capture key information for retrieval tasks.
Findings
ROM improves retrieval benchmark performance
Prioritized masking captures essential term importance
Enhanced language models facilitate better passage retrieval
Abstract
Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
