A Comprehensive Evaluation of Quantization Strategies for Large Language   Models

Renren Jin; Jiangcun Du; Wuwei Huang; Wei Liu; Jian Luan; Bin Wang,; Deyi Xiong

arXiv:2402.16775·cs.CL·June 7, 2024·5 cites

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang,, Deyi Xiong

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper evaluates various quantization strategies for large language models, demonstrating that 4-bit quantization maintains performance, perplexity correlates with benchmark results, and larger models outperform smaller ones despite slower inference.

Contribution

It introduces a structured evaluation framework for quantized LLMs across multiple benchmarks, addressing gaps in understanding their performance and efficiency.

Findings

01

4-bit quantization retains performance similar to full-precision models

02

Perplexity correlates with benchmark performance for quantized LLMs

03

Larger quantized models outperform smaller ones in various tasks

Abstract

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge \& capacity, (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cordercorder/quant_eval
pytorchOfficial

Models

🤗
alphrc/lilm
model· 6 dl· ♡ 2
6 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Recommender Systems and Techniques · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings