Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Jinchang Zhu, Jindong Li, Yuwen Hao, Chengyu Zou, Rong Fu, and Menglin Yang

TL;DR
This paper identifies a failure mode in GPT pretraining where premature upper-layer attention specialization hampers learning, and proposes a simple intervention to improve model performance.
Contribution
It reveals the impact of upper-layer attention timing on pretraining and introduces a targeted learning-rate intervention to mitigate premature specialization.
Findings
Slowing upper-layer Q/K projections improves perplexity and accuracy.
Gated FFNs suppress residual writes, preventing premature attention.
Interventions reduce residual-energy growth, enhancing training stability.
Abstract
A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
