Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Tanish Lad; Himanshu Maheshwari; Shreyas Kottukkal; Radhika Mamidi

arXiv:2211.13815·cs.CL·November 28, 2022·1 cites

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Tanish Lad, Himanshu Maheshwari, Shreyas Kottukkal, Radhika Mamidi

PDF

Open Access

TL;DR

This paper introduces a task-specific masking strategy during BERT pre-training, which improves downstream task performance by emphasizing important words through selective masking.

Contribution

It proposes a novel selective masking method that tailors pre-training to specific tasks, enhancing the transferability of language models.

Findings

01

Selective masking outperforms random masking in downstream tasks

02

Task-specific word importance improves model fine-tuning

03

Method enhances BERT's adaptability to various NLP tasks

Abstract

Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word's importance for the task, called the word's task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Dense Connections · Residual Connection · Layer Normalization · WordPiece · Adam · Linear Warmup With Linear Decay