A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun; Zhuang Liu; Anna Bair; J. Zico Kolter

arXiv:2306.11695·cs.CL·May 7, 2024·52 cites

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

PDF

Open Access 5 Repos 10 Models 1 Video 3 Reviews

TL;DR

Wanda is a simple, retraining-free pruning method for large language models that prunes weights based on their magnitude and input activations, effectively inducing sparsity without performance loss.

Contribution

Introduces Wanda, a novel pruning approach that does not require retraining or weight updates, leveraging input activations to prune weights in pretrained LLMs.

Findings

01

Wanda outperforms magnitude pruning baselines on LLaMA models.

02

Wanda performs competitively with methods requiring weight updates.

03

The method is effective across various language benchmarks.

Abstract

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

Broadly, the manuscript gives timely insight and intuition for the problem of LLM pruning. Its proposed approach solves several problems with LLM pruning, making it faster, more performant, and simpler. The authors make a helpful connection of their approach (Wanda) to existing work (SparseGPT), showing the similarity of their pruning scores when an assumption is made on the Hessian structure. This helps justify the pruning score used by Wanda, which is surprisingly principled given its simplic

Weaknesses

The main weakness of the manuscript is that it leaves unclear the benefits of Wanda to inference speed. While Wanda can accelerate matrix multiplications (Table 6), readers will be left curious about how inference timings are affected by Wanda. As my question below clarifies, this can be easily addressed.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The work conducts extensive experiments and demonstrates that the pruned models outperform SparseGPT. 2. The method requires no retraining or weight update for LLMs, and the pruning speed is very fast (in seconds)

Weaknesses

This work proposes a simple yet effective one-shot pruning method for LLMs, which has demonstrated superior performance over sparseGPT. However, I have concerns regarding its incremental contributions due to the following reasons: 1. The paper's introduction of a method to estimate weight importance based on both activation and weights does not appear to be novel. Similar concepts have been explored in previous works on LLM quantization, such as the AWQ work [1] 2. The pruning method proposed i

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. Although Wanda is quite simple and there are similar pruning metrics in the traditional pruning field, it is indeed buillt upon the consideration of outlier weights in LLM, rendering the narrative of the article easy to comprehend and follow. Thus, I believe that this paper holds significant value for the community. 2. The proposed Wanda method is highly efficient. Notably, it does not require backpropagation like SparseGPT, enhancing its applicability across various terminals to a considera

Weaknesses

1. The authors primarily focus on experiments at a low sparsity rate of 50%, yet at a high sparsity rate (80%), Wanda's performance noticeably lags behind SparseGPT, which somewhat dampens my enthusiasm for this paper. 2. While the authors emphasize efficiency, and Wanda indeed greatly surpasses SparseGPT in efficiency (for example, 0.54s for WANDA and 203.1s for SparseGPT when pruning Llama-7b), I would like to question whether this disparity in time consumption truly holds value. My point bei

Code & Models

Repositories

Models

Videos

A Simple and Effective Pruning Approach for Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices

MethodsPruning