Compressing LLMs: The Truth is Rarely Pure and Never Simple

Ajay Jaiswal; Zhe Gan; Xianzhi Du; Bowen Zhang; Zhangyang Wang; Yinfei; Yang

arXiv:2310.01382·cs.CL·March 19, 2024·2 cites

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei, Yang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper critically evaluates existing LLM compression methods using a new benchmark, revealing limitations of pruning and quantization in preserving model capabilities beyond perplexity metrics.

Contribution

It introduces LLM-KICK, a comprehensive benchmark for assessing compressed LLMs across multiple tasks, highlighting the shortcomings of current compression techniques.

Findings

01

Pruning methods degrade performance significantly at low sparsity levels.

02

Quantization methods outperform pruning in maintaining capabilities.

03

Pruned LLMs remain robust in retrieval and summarization tasks at high sparsity.

Abstract

Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

- timely ... with an array of papers on compressing LLMs with especially surprising results such as training free pruning coming out. It is important to enable researchers with better tools of evaluation - provides a decent array of dataset benchmarks that will be use ful in research. - clearly shows the gap between evaluation of perplexity and other proposed datasets.

Weaknesses

Not weaknesses. but suggestions. 1. add a summarizing table to list dataset statistics.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

I think it is important to have a more fine-grained understanding of compression methods, specially to design new algorithms that can improve upon current weaknesses.

Weaknesses

- This paper is essentially benchmarking a few algorithms on a few datasets. Although the insights are interesting, the paper does not include any new model, data or algorithm, which I'd say makes this paper more suitable for a workshop, not a full conference paper. - Some arguments are rather subjective. Why choose the 5% threshold? If we change the threshold to 10% it seems 4-bit quantization is then in the range in most cases, and sparse models can still be "competitive" for around 50% spars

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

1. Compression of LLMs is very timely and important. 2. The paper reveals new and yet widely unknown gaps in compressed LLMs in comparison to their uncompressed counterparts. 3. The paper shows that compressed models may offer better performance in some tasks (e.g., In-Context Text Summarization) than others (e.g., Factoid-based Question Answering) 4. The authors plan to release their code which may be help in the development of future compression techniques.

Weaknesses

1. It would make the conclusions more robust and convincing if the evaluations use more than a single family of LLMs (i.e., Vicuna). Why not repeat these experiments with, e.g., Llama 2 and Falcon? 2. Regarding the observation that even 8-bit quantization has evident gaps with respect to uncompressed models, have the authors considered evaluating LLM.int8()? (https://arxiv.org/pdf/2208.07339.pdf) 3. It would help the reader to have a table summarizing all the tasks' performance over the dif

Code & Models

Repositories

vita-group/llm-kick
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComparative and International Law Studies · European and International Contract Law · Corporate Governance and Law

MethodsPruning