TL;DR
This paper investigates the effectiveness of intermediate-task fine-tuning (STILTs) in pretrained language models, revealing that simple tasks can be as beneficial as complex reasoning tasks, prompting a re-evaluation of its underlying mechanisms.
Contribution
It demonstrates that the benefits of intermediate fine-tuning are not solely due to complex reasoning skills, challenging previous assumptions and offering new insights into STILTs.
Findings
Simple real-fake discrimination tasks can improve diverse target tasks
Intermediate tasks' effectiveness is orthogonal to their complexity or reasoning requirements
Reevaluates the role of intermediate fine-tuning in NLP models
Abstract
Supplementary Training on Intermediate Labeled-data Tasks (STILTs) is a widely applied technique, which first fine-tunes the pretrained language models on an intermediate task before on the target task of interest. While STILTs is able to further improve the performance of pretrained language models, it is still unclear why and when it works. Previous research shows that those intermediate tasks involving complex inference, such as commonsense reasoning, work especially well for RoBERTa. In this paper, we discover that the improvement from an intermediate task could be orthogonal to it containing reasoning or other complex skills -- a simple real-fake discrimination task synthesized by GPT2 can benefit diverse target tasks. We conduct extensive experiments to study the impact of different factors on STILTs. These findings suggest rethinking the role of intermediate fine-tuning in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · Dropout · Softmax · Weight Decay · Adam
