Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

Jinchang Zhu; Jindong Li; Yuwen Hao; Chengyu Zou; Rong Fu; and Menglin Yang

arXiv:2605.10504·cs.CL·May 12, 2026

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

Jinchang Zhu, Jindong Li, Yuwen Hao, Chengyu Zou, Rong Fu, and Menglin Yang

PDF

TL;DR

This paper identifies a failure mode in GPT pretraining where premature upper-layer attention specialization hampers learning, and proposes a simple intervention to improve model performance.

Contribution

It reveals the impact of upper-layer attention timing on pretraining and introduces a targeted learning-rate intervention to mitigate premature specialization.

Findings

01

Slowing upper-layer Q/K projections improves perplexity and accuracy.

02

Gated FFNs suppress residual writes, preventing premature attention.

03

Interventions reduce residual-energy growth, enhancing training stability.

Abstract

A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.