Breaking through the learning plateaus of in-context learning in   Transformer

Jingwen Fu; Tao Yang; Yuwang Wang; Yan Lu; Nanning Zheng

arXiv:2309.06054·cs.LG·June 7, 2024·1 cites

Breaking through the learning plateaus of in-context learning in Transformer

Jingwen Fu, Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng

PDF

Open Access

TL;DR

This paper investigates the causes of learning plateaus in Transformer models' in-context learning and proposes strategies to overcome these barriers, enhancing the models' learning efficiency and capabilities.

Contribution

It introduces a novel conceptual framework separating model components and develops strategies to accelerate in-context learning in Transformers.

Findings

01

Learning plateaus are linked to impaired weights component functionality.

02

Proposed strategies effectively break learning plateaus in synthetic tasks.

03

Strategies also improve in-context learning in NLP applications.

Abstract

In-context learning, i.e., learning from context examples, is an impressive ability of Transformer. Training Transformers to possess this in-context learning skill is computationally intensive due to the occurrence of learning plateaus, which are periods within the training process where there is minimal or no enhancement in the model's in-context learning capability. To study the mechanism behind the learning plateaus, we conceptually seperate a component within the model's internal representation that is exclusively affected by the model's weights. We call this the "weights component", and the remainder is identified as the "context component". By conducting meticulous and controlled experiments on synthetic tasks, we note that the persistence of learning plateaus correlates with compromised functionality of the weights component. Recognizing the impaired performance of the weights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings