TL;DR
Layer pruning effectively compresses large language models for classification but significantly impairs generative reasoning capabilities, with limited recovery even after finetuning on large datasets.
Contribution
This paper demonstrates the fundamental limitations of layer pruning for preserving generative reasoning in large language models, highlighting the difficulty of restoring reasoning skills post-pruning.
Findings
Pruning causes loss of key algorithmic capabilities like arithmetic and parenthesis generation.
Supervised finetuning recovers up to 90% of classification performance but not reasoning.
Even extensive post-training on large datasets fails to restore original reasoning abilities.
Abstract
Recent work has shown that layer pruning can effectively compress large language models (LLMs) while retaining strong performance on classification benchmarks, often with little or no finetuning. In contrast, generative reasoning tasks, such as GSM8K and HumanEval\textsuperscript{+}, exhibit substantially weaker recovery. We show that beyond surface-level text degradation, pruning leads to a loss of key algorithmic capabilities, including arithmetic computation and balanced parenthesis generation. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a minimal recovery strategy based on supervised finetuning with self-generated responses. This approach recovers up to 90\% of baseline performance on classification tasks, but recovery for generative reasoning remains fundamentally limited. Notably, even models finetuned on 400B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
