Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog
Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh

TL;DR
This paper introduces a gradual compression method for large language models that reduces size significantly while preserving reasoning performance, using iterative pruning and finetuning to avoid abrupt drops in capability.
Contribution
The paper proposes a novel iterative pruning and finetuning approach called Prune-Tune Loop (PTL) that enables effective model compression without performance loss.
Findings
Compresses models to nearly half size with minimal performance drop
Flexible application across different pruning and training strategies
Effective on reasoning, code generation, and other tasks
Abstract
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional pruning methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the compression process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without…
Peer Reviews
Decision·Submitted to ICLR 2026
- The pruning and tuning loop makes sense to me. It is well-motivated to recover the model's ability after pruning. - The ablation studies on pruning step size, iterations, and order provide valuable insights into the method's behavior and stability. - The comparison against the one-shot pruning methods demonstrates their effectiveness in ability recovery.
- I think a significant weakness is the lack of discussion on scalability. The method is only validated on models up to 9B parameters. Given its reliance on multiple rounds of post-training, the computational cost of larger models (e.g., 70B+) remains a critical question. - The method appears sensitive and requires per-model tuning. The need for different recovery strategies (RL for Qwen vs. Continual Pre-training for others) and the significant performance variance with different pruning step
1. The paper is well-written and easy to follow. 2. The core idea of the PTL method—progressive compression to avoid sudden performance degradation—is reasonable. Compared with one-time pruning (e.g., Prune-Once), this method demonstrates better performance recovery. 3. The paper conducts extensive tests on multiple open-source models and benchmarks, including models such as Llama3-8B, Qwen2.5-7B, and Gemma2-9B, as well as mathematical reasoning benchmarks (GSM8K, Minerva Math, MATH-500) and the
1. In my view, the biggest issue with this paper is lack of novelty. Progressive pruning and fine-tuning are not novel concepts—they were widely applied during the era of Convolutional Neural Networks. For instance, the idea of progressive pruning and fine-tuning was proposed quite early in [1]. The authors merely extended progressive pruning and fine-tuning to the pruning of LLMs, and the continuous pre-training and reinforcement learning methods they adopted are also existing algorithms. 2. Th
1. The paper demonstrates impressive compression ratios (30-40% parameter reduction) with minimal performance loss across multiple state-of-the-art models. Particularly notable is the Gemma2-9B result where PTL is the only method maintaining near-original performance after aggressive pruning. 2. The evaluation spans three different model architectures, multiple mathematical reasoning benchmarks, and includes code generation tasks. The ablation studies examining pruning iterations, step sizes, an
1. The core contribution is essentially iterative application of existing techniques (magnitude-based pruning, importance scoring via activation norms, and recovery fine-tuning). The "Prune-Tune Loop" is not fundamentally different from gradual magnitude pruning approaches that have been explored in the literature. 2. The paper lacks any theoretical analysis. There are no convergence guarantees, compression bounds, or formal analysis of why multiple small pruning steps outperform single large st
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
