The Depth Delusion: Why Transformers Should Be Wider, Not Deeper

Md Muhtasim Munif Fahim; Md Rezaul Karim

arXiv:2601.20994·cs.LG·January 30, 2026

The Depth Delusion: Why Transformers Should Be Wider, Not Deeper

Md Muhtasim Munif Fahim, Md Rezaul Karim

PDF

Open Access

TL;DR

This paper challenges the assumption that deeper transformers are always better, showing that wider models are more optimal and that beyond a critical depth, adding layers can harm performance, a phenomenon called the Depth Delusion.

Contribution

The authors introduce architecture-conditioned scaling laws that reveal optimal depth and width relationships, and identify a critical depth phenomenon in transformer models.

Findings

01

Width should grow 2.8x faster than depth for optimal performance.

02

Beyond a critical depth, adding layers increases loss despite more parameters.

03

Empirical validation across 30 architectures shows deeper isn't always better.

Abstract

Neural scaling laws describe how language model loss decreases with parameters and data, but treat architecture as interchangeable--a billion parameters could arise from a shallow-wide model (10 layers & 8,192 hidden dimension) or a deep-narrow one (80 layers & 2,048 hidden dimension). We propose architecture-conditioned scaling laws decomposing this dependence, finding that optimal depth scales as D* ~ C^0.12 while optimal width scales as W* ~ C^0.34, meaning width should grow 2.8x faster than depth. We discover a critical depth phenomenon: beyond D_crit ~ W^0.44 (sublinear in W), adding layers increases loss despite adding parameters--the Depth Delusion. Empirically, we validate these findings across 30 transformer architectures spanning 17M to 7B parameters, each trained on representative high-compute samples, achieving R^2 = 0.922. Our central finding: at 7B scale, a 64-layer model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices