Do Language Models Use Their Depth Efficiently?
R\'obert Csord\'as, Christopher D. Manning, Christopher Potts

TL;DR
This paper investigates whether deeper language models utilize their additional layers for new computations or merely distribute existing ones, finding evidence that they mainly spread the same computations across more layers, leading to diminishing returns.
Contribution
The study provides empirical analysis of residual streams in large language models, revealing that increased depth mainly results in finer adjustments rather than new types of computation.
Findings
Deeper layers contribute less to the residual stream after a phase transition.
Skipping later layers has minimal impact on outputs.
Models do not show evidence of using depth for multi-hop reasoning or new computations.
Abstract
Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax
