Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini

TL;DR
This paper introduces HCCS, a fast, hardware-efficient softmax surrogate for Transformer models that improves inference speed on AMD AI engines while maintaining accuracy, especially in low-precision scenarios.
Contribution
The paper proposes HCCS, a novel clipped-linear softmax surrogate optimized for int8 hardware implementation, outperforming existing methods in speed and preserving model accuracy.
Findings
HCCS significantly exceeds the speed of existing softmax implementations on AMD AI engines.
HCCS maintains competitive accuracy on small or quantized MHA workloads.
HCCS maps naturally to int8 MAC units, enabling high-throughput inference.
Abstract
Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
