M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type
Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong, Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng

TL;DR
MANT introduces a mathematically adaptive low-bit quantization method for LLMs, enabling more flexible and efficient compression of model weights and caches, leading to significant speed and energy improvements.
Contribution
The paper proposes MANT, a novel adaptive numeric type and supporting framework for group-wise quantization, addressing distribution diversity and real-time processing challenges in LLM deployment.
Findings
Achieves nearly 3x speedup over state-of-the-art accelerators.
Reduces energy consumption by over 2.8x.
Effectively unifies weight and cache quantization processes.
Abstract
Large language models (LLMs) are one of the most important killer computer applications. The recent algorithmic advancement proposes a fine-grained group-wise quantization for LLMs, which treats a small set (e.g., 64) of values in a tensor as a compression unit. It effectively preserves the model accuracy without retraining, and has become the standard approach to efficiently deploy LLMs. On the other hand, there are works that propose various adaptive data types to better adapt to different distributions and further reduce the required bit length for LLMs. In this work, our detailed analysis unveils a key finding that while different tensors exhibit similar distributions, small groups can have markedly different distributions. As such, the group-level diversity requires a new level of adaptivity for which existing adaptive data types fail to provide. In this paper, we propose MANT, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques
