GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
Wei Huang, Anda Cheng, Yinggui Wang

TL;DR
GradPruner is a gradient-guided layer pruning method that significantly reduces model size and maintains accuracy during fine-tuning of large language models, improving efficiency in training and inference.
Contribution
It introduces a novel gradient-based pruning technique using the IGIA-Matrix to prune LLM layers early in fine-tuning, enhancing efficiency without substantial accuracy loss.
Findings
Achieved 40% parameter reduction with less than 1% accuracy drop.
Effective across diverse datasets including medical, financial, and benchmark tasks.
Reduces training and inference costs for LLMs.
Abstract
Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the…
Peer Reviews
Decision·ICLR 2026 Poster
- **Simple, pragmatic pipeline.** Early-step gradient accumulation → IGIA-based layer scoring (sum over linear sublayers) → pruning → sign-based merging. The pruning score is clearly defined. - **Operationally clear merging.** “Top-p% by IGIA then sign-consistent addition into the preceding kept layer” is straightforward to implement and shown with a framework figure. - **Broad task coverage with efficiency reporting.** Ablations include number of merged layers and sparsity-rate sweeps for the p
### (A) Insufficient theoretical grounding for **merging** (most important) - **Self-inconsistency between pruning and merging.** The paper emphasizes that **layers differ in importance** and uses **IGIA** to make importance-aware pruning decisions. However, during **merging**, contributions from pruned layers are **added with equal weight** whenever signs match—**without any sensitivity weighting** (e.g., IGIA- or Fisher-based) for either donor or receiver layers. This disconnect undermines the
1. The use of IGIA-Matrix computed from <1% of training steps is original and empirically justified by gradient sensitivity analysis. 2. The proposed sign-consistent merging technique effectively preserves accuracy even under 40% pruning. 3. The authors test across multiple domains, model sizes, and fine-tuning regimes with strong baselines, demonstrating robustness. 4. Substantial reductions in both training and inference costs (~35–40%) while maintaining accuracy are practically valuable
1. While the empirical gradient correlation study is convincing, the paper lacks a deeper theoretical analysis of why early gradient accumulation correlates with long-term importance, beyond empirical observation. 2. The method assumes access to LoRA gradients and may not generalize to non-LoRA or adapter-free fine-tuning setups. 3. The layer-importance estimation could behave differently for tasks with varying gradient noise; this is not fully explored.
* GradPruner jointly improves both training and inference efficiency. Unlike many pruning approaches that focus only on inference speed-ups, GradPruner is explicitly designed to reduce fine-tuning time and memory consumption as well. I believe this aligns well with the practical need for LLMs to quickly adapt to new downstream tasks. * GradPruner is simple to understand yet conceptually novel. The paper introduces a new gradient-based importance metric computed from less than 1% of the early fin
* Limited theoretical grounding. Lines 201–208 simulate the gradient of W via a matrix multiplication, yet the rationale for this simulation is not theoretically justified. In addition, the sign-based merging rule in Equation (5) is presented as a heuristic with little theoretical explanation. Clearer derivations would strengthen the contribution. * Concern over the stability of early-step gradients. The method relies on gradient statistics collected within the first 1% of fine-tuning steps to e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling
