The (In)Effectiveness of Intermediate Task Training For Domain Adaptation and Cross-Lingual Transfer Learning
Sovesh Mohapatra, Somesh Mohapatra

TL;DR
This paper investigates the effectiveness of intermediate task training in transfer learning for NLP, finding that direct fine-tuning often outperforms intermediate training except for more generalized tasks, providing guidance for NLP practitioners.
Contribution
It provides a comprehensive analysis of when intermediate task training helps or hinders transfer learning across multiple NLP tasks and models.
Findings
Fine-tuning without intermediate training often yields better performance.
Intermediate training benefits more generalized tasks.
Results vary depending on task specificity and model used.
Abstract
Transfer learning from large language models (LLMs) has emerged as a powerful technique to enable knowledge-based fine-tuning for a number of tasks, adaptation of models for different domains and even languages. However, it remains an open question, if and when transfer learning will work, i.e. leading to positive or negative transfer. In this paper, we analyze the knowledge transfer across three natural language processing (NLP) tasks - text classification, sentimental analysis, and sentence similarity, using three LLMs - BERT, RoBERTa, and XLNet - and analyzing their performance, by fine-tuning on target datasets for domain and cross-lingual adaptation tasks, with and without an intermediate task training on a larger dataset. Our experiments showed that fine-tuning without an intermediate task training can lead to a better performance for most tasks, while more generalized tasks might…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Attention Dropout · WordPiece · Dropout · Layer Normalization · Softmax · BERT
