The Depth Delusion: Why Transformers Should Be Wider, Not Deeper
Md Muhtasim Munif Fahim, Md Rezaul Karim

TL;DR
This paper challenges the assumption that deeper transformers are always better, showing that wider models are more optimal and that beyond a critical depth, adding layers can harm performance, a phenomenon called the Depth Delusion.
Contribution
The authors introduce architecture-conditioned scaling laws that reveal optimal depth and width relationships, and identify a critical depth phenomenon in transformer models.
Findings
Width should grow 2.8x faster than depth for optimal performance.
Beyond a critical depth, adding layers increases loss despite more parameters.
Empirical validation across 30 architectures shows deeper isn't always better.
Abstract
Neural scaling laws describe how language model loss decreases with parameters and data, but treat architecture as interchangeable--a billion parameters could arise from a shallow-wide model (10 layers & 8,192 hidden dimension) or a deep-narrow one (80 layers & 2,048 hidden dimension). We propose architecture-conditioned scaling laws decomposing this dependence, finding that optimal depth scales as D* ~ C^0.12 while optimal width scales as W* ~ C^0.34, meaning width should grow 2.8x faster than depth. We discover a critical depth phenomenon: beyond D_crit ~ W^0.44 (sublinear in W), adding layers increases loss despite adding parameters--the Depth Delusion. Empirically, we validate these findings across 30 transformer architectures spanning 17M to 7B parameters, each trained on representative high-compute samples, achieving R^2 = 0.922. Our central finding: at 7B scale, a 64-layer model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
