Using Selective Masking as a Bridge between Pre-training and Fine-tuning
Tanish Lad, Himanshu Maheshwari, Shreyas Kottukkal, Radhika Mamidi

TL;DR
This paper introduces a task-specific masking strategy during BERT pre-training, which improves downstream task performance by emphasizing important words through selective masking.
Contribution
It proposes a novel selective masking method that tailors pre-training to specific tasks, enhancing the transferability of language models.
Findings
Selective masking outperforms random masking in downstream tasks
Task-specific word importance improves model fine-tuning
Method enhances BERT's adaptability to various NLP tasks
Abstract
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word's importance for the task, called the word's task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Dense Connections · Residual Connection · Layer Normalization · WordPiece · Adam · Linear Warmup With Linear Decay
