Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog

Yiran Zhao; Shengyang Zhou; Zijian Wu; Tongyan Hu; Yuhui Xu; Rengan Dou; Kenji Kawaguchi; Shafiq Joty; Junnan Li; Michael Qizhe Shieh

arXiv:2602.04919·cs.LG·February 6, 2026

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog

Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a gradual compression method for large language models that reduces size significantly while preserving reasoning performance, using iterative pruning and finetuning to avoid abrupt drops in capability.

Contribution

The paper proposes a novel iterative pruning and finetuning approach called Prune-Tune Loop (PTL) that enables effective model compression without performance loss.

Findings

01

Compresses models to nearly half size with minimal performance drop

02

Flexible application across different pruning and training strategies

03

Effective on reasoning, code generation, and other tasks

Abstract

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional pruning methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the compression process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The pruning and tuning loop makes sense to me. It is well-motivated to recover the model's ability after pruning. - The ablation studies on pruning step size, iterations, and order provide valuable insights into the method's behavior and stability. - The comparison against the one-shot pruning methods demonstrates their effectiveness in ability recovery.

Weaknesses

- I think a significant weakness is the lack of discussion on scalability. The method is only validated on models up to 9B parameters. Given its reliance on multiple rounds of post-training, the computational cost of larger models (e.g., 70B+) remains a critical question. - The method appears sensitive and requires per-model tuning. The need for different recovery strategies (RL for Qwen vs. Continual Pre-training for others) and the significant performance variance with different pruning step

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper is well-written and easy to follow. 2. The core idea of the PTL method—progressive compression to avoid sudden performance degradation—is reasonable. Compared with one-time pruning (e.g., Prune-Once), this method demonstrates better performance recovery. 3. The paper conducts extensive tests on multiple open-source models and benchmarks, including models such as Llama3-8B, Qwen2.5-7B, and Gemma2-9B, as well as mathematical reasoning benchmarks (GSM8K, Minerva Math, MATH-500) and the

Weaknesses

1. In my view, the biggest issue with this paper is lack of novelty. Progressive pruning and fine-tuning are not novel concepts—they were widely applied during the era of Convolutional Neural Networks. For instance, the idea of progressive pruning and fine-tuning was proposed quite early in [1]. The authors merely extended progressive pruning and fine-tuning to the pruning of LLMs, and the continuous pre-training and reinforcement learning methods they adopted are also existing algorithms. 2. Th

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper demonstrates impressive compression ratios (30-40% parameter reduction) with minimal performance loss across multiple state-of-the-art models. Particularly notable is the Gemma2-9B result where PTL is the only method maintaining near-original performance after aggressive pruning. 2. The evaluation spans three different model architectures, multiple mathematical reasoning benchmarks, and includes code generation tasks. The ablation studies examining pruning iterations, step sizes, an

Weaknesses

1. The core contribution is essentially iterative application of existing techniques (magnitude-based pruning, importance scoring via activation norms, and recovery fine-tuning). The "Prune-Tune Loop" is not fundamentally different from gradual magnitude pruning approaches that have been explored in the literature. 2. The paper lacks any theoretical analysis. There are no convergence guarantees, compression bounds, or formal analysis of why multiple small pruning steps outperform single large st

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques