BoA: Attention-aware Post-training Quantization without Backpropagation
Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon

TL;DR
This paper introduces BoA, a backpropagation-free post-training quantization method for large language models that considers inter-layer dependencies using attention-aware Hessian matrices, achieving state-of-the-art results.
Contribution
The paper presents a novel quantization algorithm that accounts for inter-layer interactions without backpropagation, improving LLM deployment efficiency.
Findings
Outperforms existing weight quantization methods
Effective in suppressing activation outliers
Achieves state-of-the-art quantization performance
Abstract
Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices. Early methods developed for small-scale networks, such as ResNet, rely on gradient-based optimization, which becomes impractical for hyper-scale LLMs with billions of parameters. While recently proposed backpropagation-free or transformation-based methods alleviate this issue, they ignore inter-layer interactions or use the naive nearest-rounding-based quantized weight assignment to save the heavy computational cost of weight optimization. In this paper, we introduce a novel backpropagation-free PTQ algorithm that optimizes quantized weights by considering inter-layer dependencies. The key innovation is the development of attention-aware Hessian matrices that capture inter-layer interactions within the attention module. Extensive experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsSoftmax · Attention Is All You Need
