Geometric and Dynamic Scaling in Deep Transformers

Haoran Su; Chenyu You

arXiv:2601.01014·cs.LG·January 16, 2026

Geometric and Dynamic Scaling in Deep Transformers

Haoran Su, Chenyu You

PDF

Open Access

TL;DR

This paper identifies geometric issues as the cause of deep Transformer collapse and proposes a new architecture, MGT, with manifold constraints and dynamic feature erasure to enable ultra-deep training.

Contribution

It introduces a geometric framework and the Manifold-Geometric Transformer (MGT) that prevent rank collapse by constraining residual updates and allowing feature erasure.

Findings

01

Manifold constraints prevent uncontrolled drift.

02

Dynamic erasure enables stable deep representations.

03

Proposed framework predicts avoiding collapse requires geometric validity.

Abstract

Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning