TL;DR
Probe Pruning introduces a dynamic, sample-aware pruning method for LLMs that significantly improves efficiency without additional training, by selectively pruning weights based on probing crucial hidden states.
Contribution
It presents a novel online, dynamic structured pruning framework that leverages probing to identify important weights, enhancing LLM efficiency without extra modules or fine-tuning.
Findings
Achieves 2.56x lower performance degradation per runtime reduction at 40% pruning on LLaMA-2-7B.
Uses only 1.5% FLOPs for probing, maintaining high efficiency.
Outperforms state-of-the-art pruning methods in experiments.
Abstract
We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsOPT · Sparse Evolutionary Training · Pruning
