Gradient Sparsification For Masked Fine-Tuning of Transformers
James O' Neill, Sourav Dutta

TL;DR
This paper introduces GradDrop, a stochastic gradient masking method that improves fine-tuning of pretrained language models by regularizing training and enhancing performance, especially on under-resourced languages.
Contribution
The paper proposes GradDrop, a novel gradient sparsification technique that outperforms standard fine-tuning and gradual unfreezing methods in multilingual language model adaptation.
Findings
GradDrop outperforms standard fine-tuning on XGLUE benchmark.
GradDrop enhances performance on under-resourced languages.
GradDrop acts as a regularizer improving generalization.
Abstract
Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks. Fine-tuning can be achieved by freezing gradients of the pretrained network and only updating gradients of a newly added classification layer, or by performing gradient updates on all parameters. Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training. This has been an effective strategy to trade-off between storage and training speed with generalization performance. However, it is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing which may improve fine-tuning performance. In this paper, we propose to stochastically mask gradients to regularize pretrained language models for improving overall fine-tuned performance. We introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
MethodsGradient Sign Dropout · Gradient Sparsification · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
