Intermediate-Task Transfer Learning with Pretrained Models for Natural   Language Understanding: When and Why Does It Work?

Yada Pruksachatkun; Jason Phang; Haokun Liu; Phu Mon Htut; Xiaoyi; Zhang; Richard Yuanzhe Pang; Clara Vania; Katharina Kann; Samuel R. Bowman

arXiv:2005.00628·cs.CL·May 12, 2020·52 cites

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi, Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman

PDF

Open Access 1 Models

TL;DR

This study investigates when and why intermediate-task training improves pretrained language models' performance on natural language understanding tasks, revealing that tasks requiring high-level reasoning are most beneficial.

Contribution

It provides a large-scale analysis of intermediate-task transfer with RoBERTa, identifying key skills that enhance transfer learning effectiveness.

Findings

01

High-level inference tasks improve transfer performance.

02

Target task success correlates with coreference resolution abilities.

03

Forgetting pretraining knowledge may limit transfer improvements.

Abstract

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
canwenxu/BERT-of-Theseus-MNLI
model· 5 dl· ♡ 1
5 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · RoBERTa · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece