Understanding the Impact of Post-Training Quantization on Large Language Models
Somnath Roy

TL;DR
This paper investigates how post-training quantization affects large language models' performance, focusing on hyperparameter sensitivity, inference speed, and content quality, revealing insights into 4-bit quantization techniques and their practical implications.
Contribution
It provides a comprehensive analysis of post-training quantization effects on LLMs, comparing 4-bit methods and examining hyperparameter sensitivities and inference speed impacts.
Findings
nf4 and fp4 are equally effective 4-bit quantization methods.
nf4 shows greater resilience to temperature variations in Llama2 models.
Int8 quantization results in slower inference speeds compared to unquantized models.
Abstract
Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsBalanced Selection
