Rethinking Why Intermediate-Task Fine-Tuning Works

Ting-Yun Chang; Chi-Jen Lu

arXiv:2108.11696·cs.CL·September 2, 2021

Rethinking Why Intermediate-Task Fine-Tuning Works

Ting-Yun Chang, Chi-Jen Lu

PDF

1 Repo

TL;DR

This paper investigates the effectiveness of intermediate-task fine-tuning (STILTs) in pretrained language models, revealing that simple tasks can be as beneficial as complex reasoning tasks, prompting a re-evaluation of its underlying mechanisms.

Contribution

It demonstrates that the benefits of intermediate fine-tuning are not solely due to complex reasoning skills, challenging previous assumptions and offering new insights into STILTs.

Findings

01

Simple real-fake discrimination tasks can improve diverse target tasks

02

Intermediate tasks' effectiveness is orthogonal to their complexity or reasoning requirements

03

Reevaluates the role of intermediate fine-tuning in NLP models

Abstract

Supplementary Training on Intermediate Labeled-data Tasks (STILTs) is a widely applied technique, which first fine-tunes the pretrained language models on an intermediate task before on the target task of interest. While STILTs is able to further improve the performance of pretrained language models, it is still unclear why and when it works. Previous research shows that those intermediate tasks involving complex inference, such as commonsense reasoning, work especially well for RoBERTa. In this paper, we discover that the improvement from an intermediate task could be orthogonal to it containing reasoning or other complex skills -- a simple real-fake discrimination task synthesized by GPT2 can benefit diverse target tasks. We conduct extensive experiments to study the impact of different factors on STILTs. These findings suggest rethinking the role of intermediate fine-tuning in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

terarachang/Rethinking_STILT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · Dropout · Softmax · Weight Decay · Adam