When Does Sparsity Mitigate the Curse of Depth in LLMs

Dilxat Muhtar; Xinyuan Song; Sebastian Pokutta; Max Zimmer; Nico Pelleriti; Thomas Hofmann; and Shiwei Liu

arXiv:2603.15389·cs.CL·March 17, 2026

When Does Sparsity Mitigate the Curse of Depth in LLMs

Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, and Shiwei Liu

PDF

Open Access

TL;DR

This paper shows that sparsity, both implicit and explicit, helps mitigate the curse of depth in large language models by regulating variance propagation, leading to better layer utilization and improved downstream task performance.

Contribution

The study reveals sparsity as a key mechanism for enhancing depth utilization in LLMs, supported by experiments and practical guidelines, and introduces a new understanding of sparsity's role beyond efficiency.

Findings

01

Sparsity reduces output variance and enhances layer differentiation.

02

Sparsity improves downstream task accuracy by 4.6%.

03

Implicit and explicit sparsity both contribute to better depth scaling.

Abstract

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques