A Comprehensive Evaluation on Quantization Techniques for Large Language Models
Yutong Liu, Cairong Zhao, Guosheng Hu

TL;DR
This paper provides a comprehensive, fair comparison of recent quantization techniques for large language models, analyzing their components, settings, and data formats to guide future improvements.
Contribution
It introduces a unified evaluation framework by decoupling quantization methods into two steps and systematically compares various settings and data formats.
Findings
Optimized rotation and scaling improve pre-quantization performance.
Combining low-rank compensation with GPTQ can outperform GPTQ alone.
Finer granularity enhances performance but increases storage overhead.
Abstract
For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research
