TL;DR
This paper reveals that extremely quantized large language models suffer from smoothness degradation affecting generation quality, and proposes a smoothness-preserving approach to improve performance beyond numerical accuracy.
Contribution
It introduces the importance of smoothness preservation in extreme quantization of LLMs and demonstrates its benefits over traditional accuracy-focused methods.
Findings
Smoothness degradation worsens as bit-width decreases.
Preserving smoothness improves generation quality beyond numerical accuracy.
A simple smoothness-preserving principle enhances quantized LLM performance.
Abstract
Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
