Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Ferdinand Kapl, Emmanouil Angelis, Tobias H\"oppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer

TL;DR
This paper investigates how gradually increasing the depth of Transformer models during training enhances their reasoning capabilities by improving depth utilization, restructuring residual streams, and forming modular computational blocks, thus overcoming the traditional curse of depth.
Contribution
It provides a mechanistic understanding of depth growth benefits, linking it to better depth utilization and circuit formation, and proposes a lightweight modification to further improve reasoning performance.
Findings
Depth-wise analysis shows improved utilization of later layers in grown models.
Gradual depth growth alters residual stream structure and creates permutable blocks.
Modified MIDAS achieves better downstream reasoning benchmarks.
Abstract
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csord\'as et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in…
Peer Reviews
Decision·Submitted to ICLR 2026
- Mechanistic insight into internal computation rather than only surface-level gains. - Multiple convergent forms of evidence (early-exit, swap/reverse, contribution metrics). - Clear link to Curse of Depth and demonstration of how growth reactivates late layers. - Well-scoped and well-written with a strong explanatory narrative. - Practical relevance for growth-based strategies in reasoning-oriented LMs.
### 1. Limited architectural generality All experiments are on SmolLM (360M / 1.7B), a single pre-LN, short-context family. Since the contribution is mechanistic in nature, it remains unclear whether the observed “resurrected depth utilization” is a *property of staged growth itself*, or a *property of this architecture family*. ### 2. Causal link between permutation robustness and “depth utilization” remains implicit For example, section 4.2 shows that grown models are more robust to block-le
- Mechanistic depth. Converging diagnostics (tuned‑lens, early‑exit, swap/reverse/skip) make a coherent case that growth increases effective depth usage. - Actionable variant. LIDAS is a lightweight change that preserves or improves reasoning without harming NLL (i.e., token-level negative log-likelihood/perplexity on held-out text), indicating no regression in general language modeling quality). - Reproducibility. Setups and intervention protocols are described clearly; the narrative is easy to
- Compute accounting / fairness. Main comparisons fix steps, not FLOPs. Since growth changes training compute, a FLOPs‑matched baseline (e.g., truncating baseline steps to ≈77%) is needed to support “efficiency–performance” claims. Error bars (multi‑seed) are also missing on the headline numbers. - Cross‑method context. The paper positions growth as a remedy for pre‑LN “curse of depth,” yet omits direct comparisons to LayerNorm scaling baselines. A small 2×2 factorial (Pre‑LN vs Mix‑LN) × (no gr
- Extensive experiments are conducted on the impact of deeper layers, comparing standard depth-fixed models and recent depth-growing models. - Besides, LIDAS, a variant of MIDAS method, is proposed, which attains superior performance in reasoning-intensive tasks. - The presentation is easy to follow; the hypothesis, evidence, results, and interpretation are presented clearly.
This study experimentally collects observations on depth-non-growing and depth-growing models. While I appreciate them, one of the major weaknesses of this work is that the connection between these observations is unclear, and the practical takeaway from them is limited. Depth-fixed models do not fully take advantage of the depth, and deeper layers can be dropped with a subtle cost in performance. This has been known already, and the experiments collect related observations from layer-wise ana
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
