K-Quantization and its Impact on Output Performance
Robin Baki Davidsson, Pierre Nugues

TL;DR
This paper examines how different levels of quantization from 2 to 8 bits affect the performance of large language models across various NLP tasks, highlighting the trade-offs between efficiency and accuracy.
Contribution
It provides a comprehensive analysis of quantization effects on multiple LLMs, revealing how model size and task type influence performance degradation at lower precisions.
Findings
Higher precision (e.g., 8-bit) improves performance with diminishing returns.
Aggressive quantization (e.g., 2-bit) often retains acceptable accuracy for some models.
Larger models are more resilient to lower-bit quantization, especially in mid-sized ranges.
Abstract
Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8\_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2\_K) usually retains acceptable accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
