Hardware-Software Co-Design for Accelerating Transformer Inference   Leveraging Compute-in-Memory

Dong Eun Kim; Tanvi Sharma; and Kaushik Roy

arXiv:2502.12344·cs.AR·February 19, 2025

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory

Dong Eun Kim, Tanvi Sharma, and Kaushik Roy

PDF

Open Access

TL;DR

This paper introduces HASTILY, a hardware-software co-designed compute-in-memory accelerator that significantly speeds up transformer attention mechanisms, especially softmax, while reducing memory and energy requirements.

Contribution

The paper presents a novel CIM-based architecture with unified compute and lookup modules, enabling efficient softmax acceleration and linearized attention computation in transformers.

Findings

01

Achieves 4.4x-9.8x throughput improvement over Nvidia A40 GPU.

02

Provides 16x-36x energy-efficiency gains over GPU.

03

Reduces memory dependence from quadratic to linear with respect to sequence length.

Abstract

Transformers have become the backbone of neural network architecture for most machine learning applications. Their widespread use has resulted in multiple efforts on accelerating attention, the basic building block of transformers. This paper tackles the challenges associated with accelerating attention through a hardware-software co-design approach while leveraging compute-in-memory(CIM) architecture. In particular, our energy- and area-efficient CIM based accelerator, named HASTILY, aims to accelerate softmax computation, an integral operation in attention, and minimize their high on-chip memory requirements that grows quadratically with input sequence length. Our architecture consists of novel CIM units called unified compute and lookup modules(UCLMs) that integrate both lookup and multiply-accumulate functionality within the same SRAM array, incurring minimal area overhead over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Memory and Neural Computing · Fault Detection and Control Systems