On the Influence of Masking Policies in Intermediate Pre-training
Qinyuan Ye, Belinda Z. Li, Sinong Wang, Benjamin Bolte, Hao Ma,, Wen-tau Yih, Xiang Ren, Madian Khabsa

TL;DR
This paper investigates how different masking policies during intermediate pre-training affect NLP model performance, introducing automated methods to optimize these policies and demonstrating their transferability across tasks.
Contribution
It provides a large-scale empirical analysis of masking policies, proposes automated approaches for discovering optimal policies, and shows their effectiveness and transferability.
Findings
Learned masking policies outperform heuristic ones on TriviaQA.
Intermediate pre-training effectiveness depends on corpus and output format.
Masking policies can transfer positively across related tasks.
Abstract
Current NLP models are predominantly trained through a two-stage "pre-train then fine-tune" pipeline. Prior work has shown that inserting an intermediate pre-training stage, using heuristic masking policies for masked language modeling (MLM), can significantly improve final performance. However, it is still unclear (1) in what cases such intermediate pre-training is helpful, (2) whether hand-crafted heuristic objectives are optimal for a given task, and (3) whether a masking policy designed for one task is generalizable beyond that task. In this paper, we perform a large-scale empirical study to investigate the effect of various masking policies in intermediate pre-training with nine selected tasks across three categories. Crucially, we introduce methods to automate the discovery of optimal masking policies via direct supervision or meta-learning. We conclude that the success of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
