Inverse Depth Scaling From Most Layers Being Similar
Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

TL;DR
This paper investigates how depth influences loss in large language models, revealing an inverse proportionality that suggests layers act similarly, which impacts how we should design more efficient architectures.
Contribution
It provides the first detailed analysis of how depth affects loss in LLMs, showing layers behave similarly and proposing architectural changes for better efficiency.
Findings
Loss scales inversely with depth in LLMs
Layers tend to be functionally similar, reducing error via ensemble effects
Current residual architectures may limit the potential of depth
Abstract
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
