# Hardware-Oriented Approximations of Softmax and RMSNorm for Efficient Transformer Inference

**Authors:** Yiwen Kang, Dong Wang

PMC · DOI: 10.3390/mi17010084 · 2026-01-07

## TL;DR

This paper introduces efficient hardware methods to approximate Softmax and RMSNorm in Transformers, reducing resource use and latency without sacrificing accuracy.

## Contribution

Novel hardware-oriented approximations for Softmax and RMSNorm with optimized bit-width and error compensation for efficient Transformer inference.

## Key findings

- A bipartite lookup table with range reduction reduces Softmax computation cost and maintains accuracy.
- RMSNorm is accelerated using LOD-driven decomposition, enabling parallel computation with minimal hardware.
- An FPGA-based accelerator achieves low latency and power consumption with reduced resource usage.

## Abstract

With the rapid advancement of Transformer-based large language models (LLMs), these models have found widespread applications in industrial domains such as code generation and non-functional requirement (NFR) classification in software engineering. However, recent research has primarily focused on optimizing linear matrix operations, while nonlinear operators remain relatively underexplored. This paper proposes hardware-efficient approximation and acceleration methods for the Softmax and RMSNorm operators to reduce resource cost and accelerate Transformer inference while maintaining model accuracy. For the Softmax operator, an additional range reduction based on the SafeSoftmax technique enables the adoption of a bipartite lookup table (LUT) approximation and acceleration. The bit-width configuration is optimized through Pareto frontier analysis to balance precision and hardware cost, and an error compensation mechanism is further applied to preserve numerical accuracy. The division is reformulated as a logarithmic subtraction implemented with a small LOD-driven lookup table, eliminating expensive dividers. For RMSNorm, LOD is further leveraged to decompose the reciprocal square root into mantissa and exponent parts, enabling parallel table lookup and a single multiplication. Based on these optimizations, an FPGA-based pipelined accelerator is implemented, achieving low operator-level latency and power consumption with significantly reduced hardware resource usage while preserving model accuracy.

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12843610/full.md

---
Source: https://tomesphere.com/paper/PMC12843610