Understanding the Impact of Post-Training Quantization on Large Language   Models

Somnath Roy

arXiv:2309.05210·cs.CL·September 19, 2023

Understanding the Impact of Post-Training Quantization on Large Language Models

Somnath Roy

PDF

Open Access

TL;DR

This paper investigates how post-training quantization affects large language models' performance, focusing on hyperparameter sensitivity, inference speed, and content quality, revealing insights into 4-bit quantization techniques and their practical implications.

Contribution

It provides a comprehensive analysis of post-training quantization effects on LLMs, comparing 4-bit methods and examining hyperparameter sensitivities and inference speed impacts.

Findings

01

nf4 and fp4 are equally effective 4-bit quantization methods.

02

nf4 shows greater resilience to temperature variations in Llama2 models.

03

Int8 quantization results in slower inference speeds compared to unquantized models.

Abstract

Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsBalanced Selection