Geometric and Dynamic Scaling in Deep Transformers
Haoran Su, Chenyu You

TL;DR
This paper identifies geometric issues as the cause of deep Transformer collapse and proposes a new architecture, MGT, with manifold constraints and dynamic feature erasure to enable ultra-deep training.
Contribution
It introduces a geometric framework and the Manifold-Geometric Transformer (MGT) that prevent rank collapse by constraining residual updates and allowing feature erasure.
Findings
Manifold constraints prevent uncontrolled drift.
Dynamic erasure enables stable deep representations.
Proposed framework predicts avoiding collapse requires geometric validity.
Abstract
Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning
