An Analog and Digital Hybrid Attention Accelerator for Transformers with   Charge-based In-memory Computing

Ashkan Moradifirouzabadi; Divya Sri Dodla; Mingu Kang

arXiv:2409.04940·cs.AR·October 31, 2024

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

PDF

Open Access

TL;DR

This paper introduces a hybrid analog-digital accelerator for transformer attention mechanisms, utilizing charge-based in-memory computing to prune tokens and improve energy and area efficiency in CMOS technology.

Contribution

It presents a novel analog CIM core that prunes low-score tokens during runtime and a digital processor for precise computation of remaining tokens, enhancing efficiency without accuracy loss.

Findings

01

Peak energy efficiency of 14.8 TOPS/W in analog core

02

Peak area efficiency of 976.6 GOPS/mm² in analog core

03

Effective pruning of ~75% tokens during runtime

Abstract

The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm $^{2}$ in the analog core and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Neural Networks and Reservoir Computing

MethodsSoftmax · Attention Is All You Need