Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Wang Zixian

arXiv:2511.00797·cs.LG·November 4, 2025

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Wang Zixian

PDF

Open Access 3 Models

TL;DR

This paper investigates the causes of gradient suppression in transformer layers during fine-tuning, introduces diagnostic metrics to identify inflection layers, and proposes a targeted fine-tuning strategy using LoRA adapters to improve adaptation efficiency.

Contribution

It formalizes the mechanism of gradient suppression at inflection layers, introduces diagnostic metrics for layer identification, and proposes a diagnose-first, inject-light fine-tuning method with LoRA adapters.

Findings

01

Over-trained models benefit from inflection-layer LoRA injection.

02

Under-trained models experience performance degradation with inflection-layer injection.

03

Unblocking inflection layers enhances high-level and low-level feature adaptation.

Abstract

Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsThin-Film Transistor Technologies · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing