BoA: Attention-aware Post-training Quantization without Backpropagation

Junhan Kim; Ho-young Kim; Eulrang Cho; Chungman Lee; Joonyoung Kim; Yongkweon Jeon

arXiv:2406.13474·cs.LG·June 9, 2025

BoA: Attention-aware Post-training Quantization without Backpropagation

Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon

PDF

Open Access 1 Video

TL;DR

This paper introduces BoA, a backpropagation-free post-training quantization method for large language models that considers inter-layer dependencies using attention-aware Hessian matrices, achieving state-of-the-art results.

Contribution

The paper presents a novel quantization algorithm that accounts for inter-layer interactions without backpropagation, improving LLM deployment efficiency.

Findings

01

Outperforms existing weight quantization methods

02

Effective in suppressing activation outliers

03

Achieves state-of-the-art quantization performance

Abstract

Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices. Early methods developed for small-scale networks, such as ResNet, rely on gradient-based optimization, which becomes impractical for hyper-scale LLMs with billions of parameters. While recently proposed backpropagation-free or transformation-based methods alleviate this issue, they ignore inter-layer interactions or use the naive nearest-rounding-based quantized weight assignment to save the heavy computational cost of weight optimization. In this paper, we introduce a novel backpropagation-free PTQ algorithm that optimizes quantized weights by considering inter-layer dependencies. The key innovation is the development of attention-aware Hessian matrices that capture inter-layer interactions within the attention module. Extensive experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BoA: Attention-aware Post-training Quantization without Backpropagation· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsSoftmax · Attention Is All You Need