Gradient Sparsification For Masked Fine-Tuning of Transformers

James O' Neill; Sourav Dutta

arXiv:2307.10098·cs.CL·July 20, 2023

Gradient Sparsification For Masked Fine-Tuning of Transformers

James O' Neill, Sourav Dutta

PDF

Open Access

TL;DR

This paper introduces GradDrop, a stochastic gradient masking method that improves fine-tuning of pretrained language models by regularizing training and enhancing performance, especially on under-resourced languages.

Contribution

The paper proposes GradDrop, a novel gradient sparsification technique that outperforms standard fine-tuning and gradual unfreezing methods in multilingual language model adaptation.

Findings

01

GradDrop outperforms standard fine-tuning on XGLUE benchmark.

02

GradDrop enhances performance on under-resourced languages.

03

GradDrop acts as a regularizer improving generalization.

Abstract

Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks. Fine-tuning can be achieved by freezing gradients of the pretrained network and only updating gradients of a newly added classification layer, or by performing gradient updates on all parameters. Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training. This has been an effective strategy to trade-off between storage and training speed with generalization performance. However, it is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing which may improve fine-tuning performance. In this paper, we propose to stochastically mask gradients to regularize pretrained language models for improving overall fine-tuned performance. We introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis

MethodsGradient Sign Dropout · Gradient Sparsification · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings