Revisiting Few-sample BERT Fine-tuning

Tianyi Zhang; Felix Wu; Arzoo Katiyar; Kilian Q. Weinberger; Yoav; Artzi

arXiv:2006.05987·cs.CL·March 12, 2021·55 cites

Revisiting Few-sample BERT Fine-tuning

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, Yoav, Artzi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the causes of instability in few-sample BERT fine-tuning, identifies key factors affecting it, and proposes alternative practices that improve stability and re-evaluates existing methods in this context.

Contribution

It systematically analyzes instability causes in few-sample BERT fine-tuning and offers practical solutions that enhance process robustness.

Findings

01

Identified biased gradient estimation as a key instability factor

02

Alternative practices significantly improve fine-tuning stability

03

Effectiveness of recent methods diminishes with improved procedures

Abstract

This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asappresearch/revisit-bert-finetuning
pytorchOfficial

Videos

Revisiting Few-sample BERT Fine-tuning· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Software Testing and Debugging Techniques · Machine Learning in Materials Science

MethodsLinear Layer · Weight Decay · Softmax · Adam · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections