Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu

TL;DR
This paper introduces OWL, a novel pruning method for LLMs that uses layerwise sparsity ratios based on activation outliers, significantly improving performance at high sparsity levels.
Contribution
The paper proposes a new non-uniform layerwise sparsity approach, OWL, tailored to activation outliers, enhancing pruning effectiveness for large language models.
Findings
OWL outperforms state-of-the-art methods Wanda and SparseGPT in perplexity at 70% sparsity.
OWL achieves 2.6x inference speed-up in DeepSparse.
Empirical evaluation across LLaMA-V1 and OPT models demonstrates significant performance gains.
Abstract
Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with…
Peer Reviews
Decision·ICML 2024 Poster
1. The paper is well-written. The content is well-organized. 2. The proposed method achieves promising results under large sparsity.
1.While the paper addresses non-uniform-based pruning methods, the novelty appears to be constrained. The field of model compression has extensively discussed similar approaches. Besides, the authors employ a metric akin to Wanda's, with slight modifications to layerwise sparsity distribution. 2.The paper relies heavily on empirical conclusions without providing a solid theoretical foundation for the proposed method. The authors should offer theoretical proof explaining why non-uniform strategi
**Presentation** Most figures are visually very appealing. **Related work section** The related work gives a concise yet informative summary of the related work in the field (although I find some of them quite unnecessary). **Baselines** The baseline methods for layerwise sparsity are mostly well-selected.
**Clarity (Medium)** Some essential concepts are confusingly defined in the text: What exactly is $D_i$ (in LOD)? I presume that it is the ratio of "outlier-ish" weights, but how can it be over 1 in Figure 1 left? **Conclusions from "Empirical Studies I, II, III" (Major)** The overall motivation of the proposed algorithm seems to come from the so-called "empirical studies" in section 3.2., which attempt to connect the notion of LOD (layerwise outlier distribution) with the layerwise sparsit
Recent efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned without hurting performance. However, achieving higher >50% sparsity for performant LLMs remains as an open challenge. In previous quantization literature, it is known that the distribution of token features within LLMs has a strong correlation with the emergence of outliers, defined as features exhibiting significantly greater
A quite apparent limitation of this paper is its exclusive examination of unstructured pruning, without addressing more practically relevant forms of structured pruning, such as layerwise, attention head, or N:M weight sparsity. It remains uncertain whether the proposed layerwise sparsity ratio would maintain its relevance in the context of these alternative pruning approaches. I would very much like to see some preliminary results in that regard. Furthermore, there is a question about the con
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsOPT · Pruning
