Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs

Pranav Kumar Kaliaperumal

arXiv:2603.04308·cs.LG·March 5, 2026

Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs

Pranav Kumar Kaliaperumal

PDF

Open Access

TL;DR

This paper investigates activation outliers in transformer quantization, reproduces known issues, analyzes their statistical properties, and evaluates mitigation strategies, emphasizing the importance of channel-aware approaches for maintaining accuracy.

Contribution

It provides a reproducible analysis of activation outliers in transformer PTQ, introduces statistical insights, and compares mitigation methods with deployment considerations.

Findings

01

Heavy-tailed activation distributions intensify with depth.

02

Mixed precision PTQ nearly restores baseline accuracy.

03

Channel-aware quantization improves robustness over scalar clipping.

Abstract

Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems-level extension of that phenomenon in BERT-base fine-tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy-tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per-embedding-group (PEG)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications