Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism
Ihor Vasyltsov, Wooseok Chang

TL;DR
This paper introduces two LUT-based softmax approximation methods tailored for hardware-efficient implementation in modern DNNs with attention mechanisms, maintaining high accuracy across various AI tasks.
Contribution
It proposes small LUT-based softmax approximation techniques suitable for hardware acceleration in attention-based DNNs, with minimal accuracy loss.
Findings
LUT size is about 700 Bytes due to stable numerator and denominator ranges.
8-bit approximation achieves less than 1% accuracy loss.
Validated across multiple AI tasks and models with diverse benchmarks.
Abstract
There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small in multi-layer perceptron or convolutional neural networks. However, as the attention mechanisms are widely used in various modern DNNs, a cost-efficient implementation of softmax layer is becoming very important. In this paper, we propose two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs). The required size of LUT is quite small (about 700 Bytes) because ranges of numerators and denominators of softmax are stable if normalization is applied to the input. We have validated the proposed technique over different AI tasks (object detection, machine translation, sentiment analysis, and semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Adversarial Robustness in Machine Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Layer Normalization · Adam · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Residual Connection
