TL;DR
ADMM-Q is a novel layer-wise weight quantization algorithm for large language models that improves utility at aggressive quantization levels through an ADMM-based approach with convergence guarantees.
Contribution
It introduces ADMM-Q, a modular, layer-wise weight quantization method based on ADMM, enhancing post-training quantization of LLMs with improved accuracy and efficiency.
Findings
Reduces WikiText-2 perplexity on Qwen3-8B models in various quantization settings.
Decreases perplexity from 12.85 to 10.06 in weight-only setting.
Achieves better perplexity scores in SmoothQuant and SpinQuant procedures.
Abstract
Quantization is an effective strategy to reduce the storage and computation footprint of large language models (LLMs). Post-training quantization (PTQ) is a leading approach for compressing LLMs. Popular weight quantization procedures, including GPTQ and RTN, suffer in model utility, especially at aggressive quantization levels (sub-4-bit). We propose ADMM-Q, a novel weight quantization algorithm that considers the layer-wise quantization problem. Our algorithm is based on a combinatorial variant of the Alternating Direction Method of Multipliers (ADMM). Our operator-splitting procedure updates weights continuously to minimize the layer-wise reconstruction error, while gradually enforcing the quantization constraints with convergence guarantees. We propose additional algorithmic enhancements (e.g., penalty scheduling, preconditioning, and a local search post-processing step) to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
