Accumulator-Aware Post-Training Quantization for Large Language Models

Ian Colbert; Giuseppe Franco; Fabian Grob; Jinjie Zhang; Rayan Saab

arXiv:2409.17092·cs.LG·August 1, 2025

Accumulator-Aware Post-Training Quantization for Large Language Models

Ian Colbert, Giuseppe Franco, Fabian Grob, Jinjie Zhang, Rayan Saab

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AXE, a novel post-training quantization framework for large language models that ensures overflow avoidance and supports multi-stage accumulation, improving efficiency without significant loss in model performance.

Contribution

AXE is the first accumulator-aware PTQ method providing overflow guarantees and enabling multi-stage accumulation for large language models.

Findings

01

AXE maintains up to 98% of FP16 perplexity on Llama3 8B.

02

AXE surpasses naive bit width manipulation by up to 15%.

03

Supports full datapath optimization with overflow avoidance.

Abstract

When quantizing weights and activations to increasingly narrower representations, the cost of additions begins to dominate that of multiplications in multiply-accumulate (MAC) units. Recent studies show that reducing addition costs via low-precision accumulation improves throughput, power, and area across inference platforms, albeit with an increased risk of overflow. Accumulator-aware quantization research has so far only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models and datasets continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To bridge this gap, we introduce AXE, the first accumulator-aware quantization framework explicitly designed to endow overflow avoidance…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

+ The accumulation-aware approach is well motivated from the hardware and implementation perspective, because when weights and activations are quantized into low-precision, the 32-bit accumulation consumes majority of power and area. And using low-precision on accumulation may increase the risk of numerical overflow which degrades model accuracy. + The paper adopts an effective approach to theoretically gurantee overflow avoidance by constraining ||q||1 in post training quantization process. To

Weaknesses

- The adoption of the two PTQ algorithms GPFQ and OPTQ and the applicability of AXE to other PTQ need justification. - Because the concept of accumulation aware quantization was proposed from the implementation perspective. It is more convincing to demonstrate the performance in terms of latency or throughput besides model accuracy. See the questions section for more details.

Reviewer 02Rating 6Confidence 2

Strengths

Thi sis the first formal study of quantization on the accumulator size. The paper is well-written and easy to follow with theoretical justifications. The innovative idea appears in equation (17) as a layer-wise operation. The authors adapt this result for two well-known post-training quantization methods GPFQ, and OPTQ.

Weaknesses

Although this is the first study on accumulators, I doubt its usefulness. Often, the accumulator size is hardware-dependent, and sometimes even unknown. There are ways to guess the accumulator size by running various experiments, but they are not revealed by the manufacturer. In the context of quantization only weights or weight-activation, the benefit is clear; I wonder how we can benefit from quantizing accumulators unless we design a new processor or a co-processor. This limits the impact of

Reviewer 03Rating 5Confidence 4

Strengths

The paper is well-written and easy to understand. The optimization it proposes concentrates on low-level hardware details that significantly differs from existing approaches in quantization research. Notably, the issue of accumulation round-off errors, which the paper addresses, is frequently overlooked by the Efficient AI community.

Weaknesses

The paper is too focused on quatnization that is associated with the low-level hardware architecture, making me feel ICLR may not be a very suitable venue for work like this. The paper's presentation raises concerns regarding its background setup and evaluation. First, it fails to acknowledge a range of prior studies in this field, including LLM.int8, ZeroQuant, AWQ, and others. Moreover, the paper lacks comparative analysis with weight-activation quantization methods, making it difficult to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications