Compressing LLMs: The Truth is Rarely Pure and Never Simple
Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei, Yang

TL;DR
This paper critically evaluates existing LLM compression methods using a new benchmark, revealing limitations of pruning and quantization in preserving model capabilities beyond perplexity metrics.
Contribution
It introduces LLM-KICK, a comprehensive benchmark for assessing compressed LLMs across multiple tasks, highlighting the shortcomings of current compression techniques.
Findings
Pruning methods degrade performance significantly at low sparsity levels.
Quantization methods outperform pruning in maintaining capabilities.
Pruned LLMs remain robust in retrieval and summarization tasks at high sparsity.
Abstract
Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have…
Peer Reviews
Decision·ICLR 2024 poster
- timely ... with an array of papers on compressing LLMs with especially surprising results such as training free pruning coming out. It is important to enable researchers with better tools of evaluation - provides a decent array of dataset benchmarks that will be use ful in research. - clearly shows the gap between evaluation of perplexity and other proposed datasets.
Not weaknesses. but suggestions. 1. add a summarizing table to list dataset statistics.
I think it is important to have a more fine-grained understanding of compression methods, specially to design new algorithms that can improve upon current weaknesses.
- This paper is essentially benchmarking a few algorithms on a few datasets. Although the insights are interesting, the paper does not include any new model, data or algorithm, which I'd say makes this paper more suitable for a workshop, not a full conference paper. - Some arguments are rather subjective. Why choose the 5% threshold? If we change the threshold to 10% it seems 4-bit quantization is then in the range in most cases, and sparse models can still be "competitive" for around 50% spars
1. Compression of LLMs is very timely and important. 2. The paper reveals new and yet widely unknown gaps in compressed LLMs in comparison to their uncompressed counterparts. 3. The paper shows that compressed models may offer better performance in some tasks (e.g., In-Context Text Summarization) than others (e.g., Factoid-based Question Answering) 4. The authors plan to release their code which may be help in the development of future compression techniques.
1. It would make the conclusions more robust and convincing if the evaluations use more than a single family of LLMs (i.e., Vicuna). Why not repeat these experiments with, e.g., Llama 2 and Falcon? 2. Regarding the observation that even 8-bit quantization has evident gaps with respect to uncompressed models, have the authors considered evaluating LLM.int8()? (https://arxiv.org/pdf/2208.07339.pdf) 3. It would help the reader to have a table summarizing all the tasks' performance over the dif
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComparative and International Law Studies · European and International Contract Law · Corporate Governance and Law
MethodsPruning
