Revisiting Few-sample BERT Fine-tuning
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, Yoav, Artzi

TL;DR
This paper investigates the causes of instability in few-sample BERT fine-tuning, identifies key factors affecting it, and proposes alternative practices that improve stability and re-evaluates existing methods in this context.
Contribution
It systematically analyzes instability causes in few-sample BERT fine-tuning and offers practical solutions that enhance process robustness.
Findings
Identified biased gradient estimation as a key instability factor
Alternative practices significantly improve fine-tuning stability
Effectiveness of recent methods diminishes with improved procedures
Abstract
This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Software Testing and Debugging Techniques · Machine Learning in Materials Science
MethodsLinear Layer · Weight Decay · Softmax · Adam · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections
