Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs
Pranav Kumar Kaliaperumal

TL;DR
This paper investigates activation outliers in transformer quantization, reproduces known issues, analyzes their statistical properties, and evaluates mitigation strategies, emphasizing the importance of channel-aware approaches for maintaining accuracy.
Contribution
It provides a reproducible analysis of activation outliers in transformer PTQ, introduces statistical insights, and compares mitigation methods with deployment considerations.
Findings
Heavy-tailed activation distributions intensify with depth.
Mixed precision PTQ nearly restores baseline accuracy.
Channel-aware quantization improves robustness over scalar clipping.
Abstract
Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems-level extension of that phenomenon in BERT-base fine-tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy-tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per-embedding-group (PEG)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
