Sparse Attention Acceleration with Synergistic In-Memory Pruning and   On-Chip Recomputation

Amir Yazdanbakhsh; Ashkan Moradifirouzabadi; Zheng Li; Mingu Kang

arXiv:2209.00606·cs.LG·September 2, 2022·1 cites

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Amir Yazdanbakhsh, Ashkan Moradifirouzabadi, Zheng Li, Mingu Kang

PDF

Open Access

TL;DR

SPRINT is a specialized accelerator that combines in-memory pruning and on-chip recomputation to significantly speed up and energy-efficiently execute self-attention mechanisms in transformer models, while maintaining accuracy.

Contribution

This work introduces SPRINT, a novel accelerator architecture that reduces self-attention complexity from quadratic to linear using in-memory pruning and digital recomputation.

Findings

01

7.5x speedup on transformer models

02

19.6x energy reduction

03

Minimal accuracy degradation (0.36%)

Abstract

As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called SPRINT, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, SPRINT re-computes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices

MethodsPruning