SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models
Shizhuo Mao, Song Chen, Yi Kang

TL;DR
SASQ introduces a lightweight quantization-aware training method focusing on activation quantization factors, enabling high-accuracy static inference for large language models without retraining weights.
Contribution
The paper presents SASQ, a novel QAT framework that optimizes only activation quantization factors, reducing training costs and improving deployment efficiency for LLMs.
Findings
Outperforms existing quantization schemes and FP16 models.
Achieves 5.2% lower perplexity than QuaRot on LLaMA2-7B.
Achieves 4.7% lower perplexity than FP16 on WikiText2.
Abstract
Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The implementation description is detailed and easy to follow. 2. This paper test quantized model in generation task, which is omit in previous paper.
2. The writing and logical of this paper is poor and should be improved. Though I am a expert in this area, it also take me long time to understand this paper. For example, Line 260 mention Table 3, the jumping is too large to understand. This paper should be re-organized. 3. I donot agree with the claim "However, our experiments show that such transformations can be realized simply by adjusting the quantization factors". Smoothquant and QuaRot solve outlier with equivalent transformation, which
1. The paper is well written.
Manuscript related: 1. The paper advocates static activation quantization for efficient LLM inference, but resorts to dynamic activation quantization during decoding phase. 2. Line 188-191 : "Some studies attempt to mitigate this by shifting such outliers through mathematical transformations Xiao et al. (2023); Ashkboos et al. (2024), but these approaches inevitably alter the model weights, which can disrupt the delicate internal representations learned during pre-training Kumar et al. (2024)."
- The paper is well-written and demonstrates clear logical flow. - The phase-based quantization strategy for handling prefill and decoding stages is interesting and has practical value for real-world deployment.
- The core technical innovation is limited. - The main claimed advantage is reduced tuning cost, but the paper lacks a thorough analysis and quantitative comparison (e.g., in terms of computational FLOPs, training time, or energy consumption) against traditional QAT methods to substantiate this claim robustly.
Focusing QAT solely on activation scaling factors is conceptually elegant and avoids expensive weight fine-tuning.
The paper explains why SASQ works mainly empirically. A more formal analysis of why optimizing scaling factors alone can work would strengthen the contribution. The paper lacks validation on instruction-following models, which are essential for evaluating practical performance and generalization. The baselines compared in this paper are mainly LLM-QAT and SpinQuant, both of which are relatively early methods.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Multimodal Machine Learning Applications
