Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation
Wang Zixian

TL;DR
This paper investigates the causes of gradient suppression in transformer layers during fine-tuning, introduces diagnostic metrics to identify inflection layers, and proposes a targeted fine-tuning strategy using LoRA adapters to improve adaptation efficiency.
Contribution
It formalizes the mechanism of gradient suppression at inflection layers, introduces diagnostic metrics for layer identification, and proposes a diagnose-first, inject-light fine-tuning method with LoRA adapters.
Findings
Over-trained models benefit from inflection-layer LoRA injection.
Under-trained models experience performance degradation with inflection-layer injection.
Unblocking inflection layers enhances high-level and low-level feature adaptation.
Abstract
Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsThin-Film Transistor Technologies · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing
