Breaking through the learning plateaus of in-context learning in Transformer
Jingwen Fu, Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng

TL;DR
This paper investigates the causes of learning plateaus in Transformer models' in-context learning and proposes strategies to overcome these barriers, enhancing the models' learning efficiency and capabilities.
Contribution
It introduces a novel conceptual framework separating model components and develops strategies to accelerate in-context learning in Transformers.
Findings
Learning plateaus are linked to impaired weights component functionality.
Proposed strategies effectively break learning plateaus in synthetic tasks.
Strategies also improve in-context learning in NLP applications.
Abstract
In-context learning, i.e., learning from context examples, is an impressive ability of Transformer. Training Transformers to possess this in-context learning skill is computationally intensive due to the occurrence of learning plateaus, which are periods within the training process where there is minimal or no enhancement in the model's in-context learning capability. To study the mechanism behind the learning plateaus, we conceptually seperate a component within the model's internal representation that is exclusively affected by the model's weights. We call this the "weights component", and the remainder is identified as the "context component". By conducting meticulous and controlled experiments on synthetic tasks, we note that the persistence of learning plateaus correlates with compromised functionality of the weights component. Recognizing the impaired performance of the weights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings
