Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask   Training

Yuanxin Liu; Fandong Meng; Zheng Lin; Peng Fu; Yanan Cao; Weiping; Wang; Jie Zhou

arXiv:2204.11218·cs.CL·May 31, 2022

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Yuanxin Liu, Fandong Meng, Zheng Lin, Peng Fu, Yanan Cao, Weiping, Wang, Jie Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a task-agnostic mask training method to identify BERT subnetworks that better preserve pre-training performance and transferability, outperforming magnitude pruning in downstream tasks and data-scarce scenarios.

Contribution

It proposes a novel mask training approach that directly optimizes subnetworks for pre-training objectives, enhancing transferability and efficiency over traditional pruning methods.

Findings

01

Mask training improves downstream performance of BERT subnetworks.

02

The method is more efficient and effective in data-scarce fine-tuning scenarios.

03

Subnetworks found via mask training outperform those found by magnitude pruning.

Abstract

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llyx97/TAMT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Linear Layer · Adam · Multi-Head Attention · Residual Connection · Layer Normalization · Dense Connections · Attention Dropout