An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing
Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

TL;DR
This paper introduces a hybrid analog-digital accelerator for transformer attention mechanisms, utilizing charge-based in-memory computing to prune tokens and improve energy and area efficiency in CMOS technology.
Contribution
It presents a novel analog CIM core that prunes low-score tokens during runtime and a digital processor for precise computation of remaining tokens, enhancing efficiency without accuracy loss.
Findings
Peak energy efficiency of 14.8 TOPS/W in analog core
Peak area efficiency of 976.6 GOPS/mm² in analog core
Effective pruning of ~75% tokens during runtime
Abstract
The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm in the analog core and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Neural Networks and Reservoir Computing
MethodsSoftmax · Attention Is All You Need
