Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

TL;DR
This paper demonstrates that unstructured pruning can enhance test-time scaling performance of reasoning LLMs, challenging the belief that pruning always degrades model effectiveness.
Contribution
It provides extensive empirical evidence that unstructured pruning can outperform full models in reasoning tasks and explores layer-wise sparsity strategies.
Findings
Unstructured pruning improves TTS performance across multiple benchmarks.
Unstructured pruning can outperform unpruned full-weight LLMs.
Layer-wise sparsity strategies significantly impact pruning effectiveness.
Abstract
While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
