Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen; Yilong Chen; Yinqi Yang; Junyuan Shang; Zhenyu Zhang; Zefeng Zhang; Shuaiyi Nie; Shuohuan Wang; Yu Sun; Hua Wu; HaiFeng Wang; Tingwen Liu

arXiv:2603.23998·cs.CL·April 17, 2026

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu

PDF

TL;DR

The paper introduces Sparse Growing Transformer (SGT), a method that dynamically allocates depth during training by selectively looping attention heads, leading to efficiency gains and improved performance over static methods.

Contribution

SGT is a novel training-time sparse depth allocation framework that progressively extends recurrence in Transformers, reducing redundancy and enhancing training efficiency.

Findings

01

SGT outperforms static block-level looping baselines across multiple scales.

02

Reduces additional training FLOPs from 16-20% to 1-3%.

03

Leverages high-entropy attention heads for semantic integration.

Abstract

Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.