When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs

Keyu Wang; Tian Lyu; Guinan Su; Jonas Geiping; Lu Yin; Marco Canini; Shiwei Liu

arXiv:2510.22228·cs.LG·October 28, 2025

When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs

Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, Shiwei Liu

PDF

3 Reviews

TL;DR

Layer pruning in large language models significantly impairs test-time scaling and long-chain reasoning capabilities, even with minimal layer removal, revealing critical fragility in reasoning tasks that standard fine-tuning cannot fix.

Contribution

This work uncovers the detrimental effects of layer pruning on reasoning in LLMs, especially on test-time scaling, and highlights the need for new pruning strategies that preserve reasoning robustness.

Findings

01

Pruning one or two layers drastically reduces test-time scaling performance.

02

Standard fine-tuning cannot recover reasoning performance after pruning.

03

Layer pruning severely impairs long-chain reasoning in LLMs.

Abstract

Layer pruning has emerged as a widely adopted technique for improving the efficiency of large language models (LLMs). Although existing methods demonstrate strong performance retention on general knowledge tasks, their effect on long-chain reasoning, a more brittle yet crucial capability, remains largely unexplored. In this work, we study the impact of layer pruning on long-chain reasoning through the lens of test-time scaling, a key mechanism in modern LLMs that enables strong reasoning capacity by allocating more computation at inference time. With extensive experiments, we demonstrate that pruning even one or two layers can severely impair test-time scaling, with performance collapsing drastically on long reasoning benchmarks even when performance on knowledge-intensive and shallow reasoning tasks remains stable. Furthermore, we find that standard supervised fine-tuning remedies fail…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The experiments contain sufficient ablations, covering different models, datasets, pruning methods and evaluation metrics. The overall conclusions are consistent and can well support the main claim that layer pruning hurts the long chain reasoning performance.

Weaknesses

The main conclusion of the paper is simple and rather intuitive. The layer pruning method would surely decrease the performance since the model loses part of its parameters and face OOD problems compared with training. One could naturally expect these results without experiments, and the conclusions are known. The paper does not provide new interesting results, nor the solution to address the problem. Moreover, layer pruning itself is not a practically meaningful method from my perspective. La

Reviewer 02Rating 2Confidence 3

Strengths

The observation that layer pruning does not work effectively for reasoning models (which has become the de-facto model we adopt in the community for experiments) is timely and important.

Weaknesses

1. This paper presents negative results but the explanations or experimental setting to analyze the negative results are limited. For example, the observation that "most layers play a non-trivial role in enabling test-time scaling" is very interesting, but the underlying explanation for whether that is not the case for "non-reasoning models" or what is the reason behind that is very limited. 2. As a follow-up of 1, I think there should be trends of non-reasoning models on the same experimental

Reviewer 03Rating 8Confidence 3

Strengths

1) The multiple conclusions identified offer clear reference value for lightweighting reasoning LLMs and for on-device deployment. 2) The experiment covers a diverse spectrum of lightweighting techniques. 3) The work also supplies explicit qualitative and quantitative case analyses for the discovered phenomena.

Weaknesses

1) Experiments have only been conducted on models with fewer than 10B parameters; results would be more convincing if larger-scale models were also included. 2) When exploring supervised fine-tuning as a recovery remedy, incorporating the dominant RL recipes used in current reasoning-model training would further complete the findings of this work.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.