Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

Ferdinand Kapl; Emmanouil Angelis; Tobias H\"oppe; Kaitlin Maile; Johannes von Oswald; Nino Scherrer; Stefan Bauer

arXiv:2512.08819·cs.CL·December 10, 2025

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

Ferdinand Kapl, Emmanouil Angelis, Tobias H\"oppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how gradually increasing the depth of Transformer models during training enhances their reasoning capabilities by improving depth utilization, restructuring residual streams, and forming modular computational blocks, thus overcoming the traditional curse of depth.

Contribution

It provides a mechanistic understanding of depth growth benefits, linking it to better depth utilization and circuit formation, and proposes a lightweight modification to further improve reasoning performance.

Findings

01

Depth-wise analysis shows improved utilization of later layers in grown models.

02

Gradual depth growth alters residual stream structure and creates permutable blocks.

03

Modified MIDAS achieves better downstream reasoning benchmarks.

Abstract

Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csord\'as et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Mechanistic insight into internal computation rather than only surface-level gains. - Multiple convergent forms of evidence (early-exit, swap/reverse, contribution metrics). - Clear link to Curse of Depth and demonstration of how growth reactivates late layers. - Well-scoped and well-written with a strong explanatory narrative. - Practical relevance for growth-based strategies in reasoning-oriented LMs.

Weaknesses

### 1. Limited architectural generality All experiments are on SmolLM (360M / 1.7B), a single pre-LN, short-context family. Since the contribution is mechanistic in nature, it remains unclear whether the observed “resurrected depth utilization” is a *property of staged growth itself*, or a *property of this architecture family*. ### 2. Causal link between permutation robustness and “depth utilization” remains implicit For example, section 4.2 shows that grown models are more robust to block-le

Reviewer 02Rating 4Confidence 2

Strengths

- Mechanistic depth. Converging diagnostics (tuned‑lens, early‑exit, swap/reverse/skip) make a coherent case that growth increases effective depth usage. - Actionable variant. LIDAS is a lightweight change that preserves or improves reasoning without harming NLL (i.e., token-level negative log-likelihood/perplexity on held-out text), indicating no regression in general language modeling quality). - Reproducibility. Setups and intervention protocols are described clearly; the narrative is easy to

Weaknesses

- Compute accounting / fairness. Main comparisons fix steps, not FLOPs. Since growth changes training compute, a FLOPs‑matched baseline (e.g., truncating baseline steps to ≈77%) is needed to support “efficiency–performance” claims. Error bars (multi‑seed) are also missing on the headline numbers. - Cross‑method context. The paper positions growth as a remedy for pre‑LN “curse of depth,” yet omits direct comparisons to LayerNorm scaling baselines. A small 2×2 factorial (Pre‑LN vs Mix‑LN) × (no gr

Reviewer 03Rating 2Confidence 3

Strengths

- Extensive experiments are conducted on the impact of deeper layers, comparing standard depth-fixed models and recent depth-growing models. - Besides, LIDAS, a variant of MIDAS method, is proposed, which attains superior performance in reasoning-intensive tasks. - The presentation is easy to follow; the hypothesis, evidence, results, and interpretation are presented clearly.

Weaknesses

This study experimentally collects observations on depth-non-growing and depth-growing models. While I appreciate them, one of the major weaknesses of this work is that the connection between these observations is unclear, and the practical takeaway from them is limited. Depth-fixed models do not fully take advantage of the depth, and deeper layers can be dropped with a subtle cost in performance. This has been known already, and the experiments collect related observations from layer-wise ana

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)