Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation
Amir Yazdanbakhsh, Ashkan Moradifirouzabadi, Zheng Li, Mingu Kang

TL;DR
SPRINT is a specialized accelerator that combines in-memory pruning and on-chip recomputation to significantly speed up and energy-efficiently execute self-attention mechanisms in transformer models, while maintaining accuracy.
Contribution
This work introduces SPRINT, a novel accelerator architecture that reduces self-attention complexity from quadratic to linear using in-memory pruning and digital recomputation.
Findings
7.5x speedup on transformer models
19.6x energy reduction
Minimal accuracy degradation (0.36%)
Abstract
As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called SPRINT, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, SPRINT re-computes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
MethodsPruning
