The Unreasonable Ineffectiveness of the Deeper Layers
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso,, Daniel A. Roberts

TL;DR
This study investigates the importance of layers in large language models by pruning layers and observing minimal performance loss, suggesting that many deeper layers may be unnecessary for knowledge storage in common tasks.
Contribution
The paper introduces a layer pruning method combined with finetuning to assess the necessity of layers in LLMs, revealing surprising robustness to layer removal.
Findings
Up to half of the layers can be removed with minimal performance degradation.
Current pretraining may not fully utilize deeper layers for knowledge storage.
Shallow layers might be more critical than previously thought.
Abstract
How is knowledge stored in an LLM's weights? We study this via layer pruning: if removing a certain layer does not affect model performance in common question-answering benchmarks, then the weights in that layer are not necessary for storing the knowledge needed to answer those questions. To find these unnecessary parameters, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. Surprisingly, with this method we find minimal degradation of performance until after a large fraction (up to half) of the layers are removed for some common open-weight models. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow…
Peer Reviews
Decision·ICLR 2025 Poster
The work is written with a great focus on clarity, and the main findings are sufficiently supported. The main observation, that layer pruning does not impact the performance of world knowledge and comprehension benchmarks, is relevant to the community. The observation that comparing layers with very intuitive similarity metrics suggests simple and effective layer-pruning strategies is valuable.
The paper would benefit from further justification that the authors' clear and simple strategies lead to compatible results to possible more sophisticated techniques that emerged in previous literature (see the Questions section (a) for explicit examples). Several recent efforts approached the study of LLMs representations through geometry. This work would benefit from a more thorough comparison/connection to recent progress in the community along these directions (see the Questions section (
* The paper is clearly written and well-organized, making it easy to follow. The flow of the text, the intuitiveness of the figures, the comprehensive appendix, and their placement all contribute to its readability. * Despite the simplicity of the proposed pruning method, it achieves substantial gains. Given that the approach is implemented in a discrete, sparse and straightforward manner, it could also serve as an effective tool for understanding LLMs.
* The paper’s title and claims might somewhat overemphasize the findings, which could lead some readers to interpret them as suggesting that 'LLMs don’t require deep layers'. The experiments are primarily focused on relatively simple QA tasks, while the implications of results from more complex reasoning tasks (like GSM8K) aren’t highlighted enough, despite their significance. * This highlights a related concern as above: given that one of their contributions is a framework for understanding op
1. The problem setup is well-defined, and the writing is clear; 2. The claims are supported by experiments on a wide set of models; 3. The findings are interesting, novel, and potentially useful in practice to reduce the memory footprint of LLMs
The main weakness is the relatively narrow set of downstream benchmark evaluations. The paper's main message relies on multiple-choice QA benchmarks where a single token has to be generated. The authors briefly discuss the case of GSM8K, showing that it is not always convenient to prune the deeper layers. Unfortunately, that discussion is briefly mentioned in line 357, and the plot is only shown in the appendix. I encourage the authors to expand the debate on the limits of applicability of
Code & Models
- 🤗arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layermodel· 15 dl· ♡ 715 dl♡ 7
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw2.25model· 3 dl3 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw2.5model· 5 dl5 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw3model· 7 dl7 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw3.5model· 3 dl3 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw3.7model· 3 dl3 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw4model· 3 dl3 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw4.2model· 2 dl2 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw4.4model· 4 dl4 dl
- 🤗blockblockblock/Mistral-7B-Instruct-v0.2-sliced-24-layer-bpw4.6model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Multimodal Machine Learning Applications
MethodsPruning
