Hardware-Oriented Approximations of Softmax and RMSNorm for Efficient Transformer Inference

Yiwen Kang; Dong Wang

PMC · DOI:10.3390/mi17010084·January 7, 2026

Hardware-Oriented Approximations of Softmax and RMSNorm for Efficient Transformer Inference

Yiwen Kang, Dong Wang

PDF

Open Access

TL;DR

This paper introduces efficient hardware methods to approximate Softmax and RMSNorm in Transformers, reducing resource use and latency without sacrificing accuracy.

Contribution

Novel hardware-oriented approximations for Softmax and RMSNorm with optimized bit-width and error compensation for efficient Transformer inference.

Findings

01

A bipartite lookup table with range reduction reduces Softmax computation cost and maintains accuracy.

02

RMSNorm is accelerated using LOD-driven decomposition, enabling parallel computation with minimal hardware.

03

An FPGA-based accelerator achieves low latency and power consumption with reduced resource usage.

Abstract

With the rapid advancement of Transformer-based large language models (LLMs), these models have found widespread applications in industrial domains such as code generation and non-functional requirement (NFR) classification in software engineering. However, recent research has primarily focused on optimizing linear matrix operations, while nonlinear operators remain relatively underexplored. This paper proposes hardware-efficient approximation and acceleration methods for the Softmax and RMSNorm operators to reduce resource cost and accelerate Transformer inference while maintaining model accuracy. For the Softmax operator, an additional range reduction based on the SafeSoftmax technique enables the adoption of a bipartite lookup table (LUT) approximation and acceleration. The bit-width configuration is optimized through Pareto frontier analysis to balance precision and hardware cost,…

Figures9

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications