A Comprehensive Evaluation of Quantization Strategies for Large Language Models
Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang,, Deyi Xiong

TL;DR
This paper evaluates various quantization strategies for large language models, demonstrating that 4-bit quantization maintains performance, perplexity correlates with benchmark results, and larger models outperform smaller ones despite slower inference.
Contribution
It introduces a structured evaluation framework for quantized LLMs across multiple benchmarks, addressing gaps in understanding their performance and efficiency.
Findings
4-bit quantization retains performance similar to full-precision models
Perplexity correlates with benchmark performance for quantized LLMs
Larger quantized models outperform smaller ones in various tasks
Abstract
Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge \& capacity, (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
