Probe Pruning: Accelerating LLMs through Dynamic Pruning via   Model-Probing

Qi Le; Enmao Diao; Ziyan Wang; Xinran Wang; Jie Ding; Li Yang; Ali; Anwar

arXiv:2502.15618·cs.CL·February 24, 2025

Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing

Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali, Anwar

PDF

1 Repo 1 Video

TL;DR

Probe Pruning introduces a dynamic, sample-aware pruning method for LLMs that significantly improves efficiency without additional training, by selectively pruning weights based on probing crucial hidden states.

Contribution

It presents a novel online, dynamic structured pruning framework that leverages probing to identify important weights, enhancing LLM efficiency without extra modules or fine-tuning.

Findings

01

Achieves 2.56x lower performance degradation per runtime reduction at 40% pruning on LLaMA-2-7B.

02

Uses only 1.5% FLOPs for probing, maintaining high efficiency.

03

Outperforms state-of-the-art pruning methods in experiments.

Abstract

We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qi-le1/probe_pruning
pytorchOfficial

Videos

Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing· slideslive

Taxonomy

MethodsOPT · Sparse Evolutionary Training · Pruning