ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers
Gamze \.Islamo\u{g}lu, Moritz Scherer, Gianna Paulin, Tim Fischer,, Victor J.B. Jung, Angelo Garofalo, Luca Benini

TL;DR
This paper introduces ITA, an energy-efficient hardware accelerator for quantized transformer models that reduces data movement and energy use through innovative integer-based softmax computation, suitable for embedded systems.
Contribution
The paper presents a novel transformer accelerator architecture that exploits 8-bit quantization and a streaming integer softmax, achieving high energy and area efficiency.
Findings
Achieves 16.9 TOPS/W energy efficiency.
Outperforms existing accelerators in area efficiency.
Operates effectively on embedded systems with low voltage.
Abstract
Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax
