Hardware-Oriented Approximations of Softmax and RMSNorm for Efficient Transformer Inference
Yiwen Kang, Dong Wang

TL;DR
This paper introduces efficient hardware methods to approximate Softmax and RMSNorm in Transformers, reducing resource use and latency without sacrificing accuracy.
Contribution
Novel hardware-oriented approximations for Softmax and RMSNorm with optimized bit-width and error compensation for efficient Transformer inference.
Findings
A bipartite lookup table with range reduction reduces Softmax computation cost and maintains accuracy.
RMSNorm is accelerated using LOD-driven decomposition, enabling parallel computation with minimal hardware.
An FPGA-based accelerator achieves low latency and power consumption with reduced resource usage.
Abstract
With the rapid advancement of Transformer-based large language models (LLMs), these models have found widespread applications in industrial domains such as code generation and non-functional requirement (NFR) classification in software engineering. However, recent research has primarily focused on optimizing linear matrix operations, while nonlinear operators remain relatively underexplored. This paper proposes hardware-efficient approximation and acceleration methods for the Softmax and RMSNorm operators to reduce resource cost and accelerate Transformer inference while maintaining model accuracy. For the Softmax operator, an additional range reduction based on the SafeSoftmax technique enables the adoption of a bipartite lookup table (LUT) approximation and acceleration. The bit-width configuration is optimized through Pareto frontier analysis to balance precision and hardware cost,…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
