AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers
Shikhar Tuli, Niraj K. Jha

TL;DR
AccelTran introduces a sparsity-aware transformer accelerator with a dynamic pruning scheme, DynaTran, that enhances throughput and energy efficiency by reducing ineffectual computations during inference.
Contribution
This work presents DynaTran, a novel runtime activation pruning method, and an accelerator architecture, AccelTran, optimized for transformer models, achieving higher sparsity and efficiency.
Findings
DynaTran surpasses state-of-the-art pruning strategies in accuracy and sparsity.
AccelTran-Edge achieves 330K× throughput and 93K× lower energy than Raspberry Pi.
AccelTran-Server outperforms Energon with 5.73× higher throughput and 3.69× lower energy.
Abstract
Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing. Despite their efficacy, accelerating the transformer is challenging due to its quadratic computational complexity and large activation sizes. Existing transformer accelerators attempt to prune its tokens to reduce memory access, albeit with high compute overheads. Moreover, previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization. In order to address these challenges, this work proposes a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead, substantially reducing the number of ineffectual operations. This improves the throughput of transformer inference. We further propose tiling the matrices in transformer operations along with diverse dataflows to improve data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Ferroelectric and Negative Capacitance Devices
MethodsPruning
