A Digital SRAM-Based Compute-In-Memory Macro for Weight-Stationary Dynamic Matrix Multiplication in Transformer Attention Score Computation
Jianyi Yu, Tengxiao Wang, Yuxuan Wang, Xiang Fu, Fei Qiao, Ying Wang, Rui Yuan, Liyuan Liu, Cong Shi

TL;DR
This paper presents a digital compute-in-memory macro for Transformer attention score computation, achieving high energy and area efficiency by reconstructing dynamic matrix multiplication as static and optimizing data sparsity handling.
Contribution
It introduces a novel 2-input static matrix multiplication method with zero-value bit skipping, significantly improving energy and area efficiency in Transformer CIM implementations.
Findings
Achieves 42.27 GOPS at 1.24 mW in 65-nm process
Delivers 34.1 TOPS/W energy efficiency
Outperforms CPUs and GPUs by 25x and 13x respectively
Abstract
Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. To extend these benefits to Transformer, this brief proposes a digital CIM macro to compute attention score. To eliminate dynamic matrix multiplication (MM), we reconstruct the computation as static MM using a combined QK-weight matrix, so that inputs can be directly fed to a single CIM macro to obtain the score results. However, this introduces a new challenge of 2-input static MM. The computation is further decomposed into four groups of bit-serial logical and addition operations. This allows 2-input to directly activate the word line via AND gate, thus realizing 2-input static MM with minimal overhead. A hierarchical zero-value bit skipping mechanism is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Low-power high-performance VLSI design
