TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation
Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

TL;DR
TurboBoA is a novel post-training quantization method for large language models that achieves faster processing and higher accuracy by jointly quantizing multiple channels and correcting propagated errors without backpropagation.
Contribution
It introduces a backpropagation-free quantization algorithm with joint channel quantization, error correction, and adaptive refinement, significantly improving speed and accuracy over prior methods.
Findings
TurboBoA is over three times faster than BoA.
It consistently improves quantization accuracy across experiments.
Achieves state-of-the-art results with outlier suppression techniques.
Abstract
The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a…
Peer Reviews
Decision·ICLR 2026 Poster
The paper provides three propositions with closed-form solutions for joint quantization, cross-block error compensation, and adaptive grid selection. The mathematical derivations are rigorous and elegant, enabling efficient backpropagation-free optimization. (Note that I have not checked the math very carefully.) TurboBoA achieves 4-6× speedup over BoA while improving accuracy. For INT2 quantization on Llama3.2-1B, it reduces Wiki2 PPL from 40.86 to 33.33 while cutting processing time from 13.3
There's no theoretical bound on accuracy degradation as a function of $N$ (number of jointly quantized channels) and model properties. The improvement compared with BoA is not practical. Though it reduces Wiki2 PPL from 40.86 to 33.33 while cutting processing time from 13.33 to 5.33 minutes, but in my opinion, for a LLM PTQ method, the calibration time reduced from 13 to 5 mins, is not a practical improvement. Important design choices lack ablation studies, including the number of coordinate
the motivation (reduce BOA’s sequential dependency bottleneck, accumulation error, fixed grid) is legitimate and practically important The methods has good empirical improvement over BOA. writing quality is mostly clean
Motivation: the method does not completely consider cross-layer dependency, as the objective follows BOA in Table 1. As I understand, instead of cross layer dependency, it seems like sequential optimisation taking into account the interaction of attention mask, especially for the objective of W_q and W_k Technical novelty: 1. the ΔX-aware accumulation term and how the author solve this problem is conceptually very close to GPTAQ-style accumulated loss [1]. Without explicit comparison, this make
* **Clear Motivation and Strong Problem Definition.** * The paper clearly articulates the limitations of existing backpropagation-free PTQ methods, situating the work in a well-understood context. * It correctly identifies a critical trade-off: GPTQ's layer-wise independence assumption leads to high speed but poor accuracy in low-bit regimes, whereas BoA's attention-aware dependency modeling is accurate but suffers from a severe bottleneck due to its sequential processing of out-channels
* **Analysis of Hyperparameter $N$.** * The paper introduces $N$, the number of jointly quantized out-channels, as a new and important hyperparameter governing the speed/accuracy trade-off. * While Table 2 ablates $N$ for $N \in \{1, 4, 8, 16\}$, all subsequent experiments (Tables 3, 4, 5, 6, 7) appear to use a fixed $N=16$. * The manuscript does not provide a discussion on how $N=16$ was chosen as the default, or how sensitive the final state-of-the-art results are to this specific
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Multimodal Machine Learning Applications
