TL;DR
Wanda is a simple, retraining-free pruning method for large language models that prunes weights based on their magnitude and input activations, effectively inducing sparsity without performance loss.
Contribution
Introduces Wanda, a novel pruning approach that does not require retraining or weight updates, leveraging input activations to prune weights in pretrained LLMs.
Findings
Wanda outperforms magnitude pruning baselines on LLaMA models.
Wanda performs competitively with methods requiring weight updates.
The method is effective across various language benchmarks.
Abstract
As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can…
Peer Reviews
Decision·ICLR 2024 poster
Broadly, the manuscript gives timely insight and intuition for the problem of LLM pruning. Its proposed approach solves several problems with LLM pruning, making it faster, more performant, and simpler. The authors make a helpful connection of their approach (Wanda) to existing work (SparseGPT), showing the similarity of their pruning scores when an assumption is made on the Hessian structure. This helps justify the pruning score used by Wanda, which is surprisingly principled given its simplic
The main weakness of the manuscript is that it leaves unclear the benefits of Wanda to inference speed. While Wanda can accelerate matrix multiplications (Table 6), readers will be left curious about how inference timings are affected by Wanda. As my question below clarifies, this can be easily addressed.
1. The work conducts extensive experiments and demonstrates that the pruned models outperform SparseGPT. 2. The method requires no retraining or weight update for LLMs, and the pruning speed is very fast (in seconds)
This work proposes a simple yet effective one-shot pruning method for LLMs, which has demonstrated superior performance over sparseGPT. However, I have concerns regarding its incremental contributions due to the following reasons: 1. The paper's introduction of a method to estimate weight importance based on both activation and weights does not appear to be novel. Similar concepts have been explored in previous works on LLM quantization, such as the AWQ work [1] 2. The pruning method proposed i
1. Although Wanda is quite simple and there are similar pruning metrics in the traditional pruning field, it is indeed buillt upon the consideration of outlier weights in LLM, rendering the narrative of the article easy to comprehend and follow. Thus, I believe that this paper holds significant value for the community. 2. The proposed Wanda method is highly efficient. Notably, it does not require backpropagation like SparseGPT, enhancing its applicability across various terminals to a considera
1. The authors primarily focus on experiments at a low sparsity rate of 50%, yet at a high sparsity rate (80%), Wanda's performance noticeably lags behind SparseGPT, which somewhat dampens my enthusiasm for this paper. 2. While the authors emphasize efficiency, and Wanda indeed greatly surpasses SparseGPT in efficiency (for example, 0.54s for WANDA and 203.1s for SparseGPT when pruning Llama-7b), I would like to question whether this disparity in time consumption truly holds value. My point bei
Code & Models
- 🤗wang7776/Llama-2-7b-chat-hf-10-sparsitymodel· 811 dl811 dl
- 🤗wang7776/Llama-2-7b-chat-hf-30-sparsitymodel· 793 dl793 dl
- 🤗wang7776/Llama-2-7b-chat-hf-20-sparsitymodel· 792 dl792 dl
- 🤗wang7776/Mistral-7B-Instruct-v0.2-sparsity-10model· 706 dl706 dl
- 🤗wang7776/vicuna-7b-v1.3-sparsity-20model· 18 dl18 dl
- 🤗wang7776/vicuna-7b-v1.3-sparsity-30model· 4 dl4 dl
- 🤗wang7776/vicuna-7b-v1.3-sparsity-10model· 701 dl701 dl
- 🤗wang7776/Mistral-7B-Instruct-v0.2-sparsity-30-v0.1model· 640 dl· ♡ 1640 dl♡ 1
- 🤗wang7776/Mistral-7B-Instruct-v0.2-sparsity-20-v0.1model· 653 dl· ♡ 1653 dl♡ 1
- 🤗wang7776/Llama-2-7b-chat-hf-20-attention-sparsitymodel· 199 dl199 dl
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices
MethodsPruning
