Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Yuan Gao; Zujing Liu; Weizhong Zhang; Bo Du; Gui-Song Xia

arXiv:2406.10576·cs.LG·July 4, 2025

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a novel optimization-based structural pruning method for large language models that learns pruning masks via policy gradient without back-propagation, enabling efficient and effective model compression.

Contribution

It proposes a backpropagation-free, probabilistic pruning approach that supports global, heterogeneous pruning and can incorporate metric-based initialization, improving efficiency and flexibility.

Findings

01

Supports global and heterogeneous pruning across layers

02

Achieves competitive performance with reduced computational cost

03

Demonstrates effectiveness on multiple large language models

Abstract

Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve efficiency, our method eliminates the back-propagation through the LLM per se during optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from LLM loss, facilitating efficient optimization via policy gradient estimator without back-propagation. Thus, our method can 1) support global and heterogeneous…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The performance gain is significant. - The idea of leveraging policy gradient for pruning is novel and insightful.

Weaknesses

- The tables are hard to read. Using a figure instead will help the visualization. - How does the method compare in performance with Gumbel-Softmax approaches? The authors primarily focus on heuristic-based comparisons, leaving out Gumbel-Softmax methods. For example, [1] employs Gumbel-Softmax but focus on semi-structured sparsity. Although the objectives differ slightly, including a performance and cost comparison with Gumbel-based methods would strengthen the study. - Additionally, the compa

Reviewer 02Rating 6Confidence 4

Strengths

- This paper proposes a novel method, which casts the optimization problem of selecting optimal pruning mask as a reinforcement learning problem. This allows us to avoid the inefficiency for performing computationally intensive back-propogation. - The authors have conducted extensive experimental evaluation and compare with many existing baseline methods for structural pruning. The results show that the proposed method is a promising approach for structural pruning.

Weaknesses

- It might be good to evaluate a larger LLM model, e.g., LLaMA-3-70B. - The authors stress the memory efficiency of optimizating without back-propogation. It would be good to dedicate one section comparing the resource consumptions of different pruning approaches, especially to those with gradient computation.

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper presents a novel pruning method that leverages policy gradient estimators instead of back-propagation, addressing key computational challenges in gradient-based LLM pruning methods. 2. The method supports multiple structural granularities (channels, heads, layers), providing flexibility in how the model is pruned. It also allows for global and heterogeneous pruning, which is more aligned with the varying redundancy across layers in LLMs.

Weaknesses

1. While the paper suggests using a policy gradient estimator to bypass back-propagation, policy gradient methods can suffer from high variance, which may lead to unstable training. The paper does propose a variance-reduction technique, but the effectiveness of this could be further elaborated or validated with more ablation studies. For example, how is the performance of the proposed methods compared with the results of using back-propagation? 2. Following up on Weakness 1, could you clarify t

Videos

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsPruning · LLaMA