What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
Pulkit Gopalani, Wei Hu

TL;DR
This paper investigates the abrupt learning phenomenon in Transformers, revealing that during performance plateaus, models develop partial solutions, exhibit repetition bias, and undergo internal representation collapse, with attention learning as a key factor.
Contribution
The study uncovers the mechanisms behind abrupt learning in Transformers, highlighting the role of attention dynamics and internal collapse during performance plateaus.
Findings
Repetition bias and representation collapse occur during plateaus.
Slow attention learning is a key bottleneck in rapid convergence.
Interventions on attention can alter plateau duration and severity.
Abstract
Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Semiconductor Devices and Circuit Design · Ferroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing
MethodsPythia
