When Does Sparsity Mitigate the Curse of Depth in LLMs
Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, and Shiwei Liu

TL;DR
This paper shows that sparsity, both implicit and explicit, helps mitigate the curse of depth in large language models by regulating variance propagation, leading to better layer utilization and improved downstream task performance.
Contribution
The study reveals sparsity as a key mechanism for enhancing depth utilization in LLMs, supported by experiments and practical guidelines, and introduces a new understanding of sparsity's role beyond efficiency.
Findings
Sparsity reduces output variance and enhances layer differentiation.
Sparsity improves downstream task accuracy by 4.6%.
Implicit and explicit sparsity both contribute to better depth scaling.
Abstract
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
