Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate Attention Computation
Haibin Wu, Wenming Li, Kai Yan, Zhihua Fan, Peiyang Wu, Yuqun Liu,, Yanhuan Liu, Ziqing Qiang, Meng Wu, Kunming Liu, Xiaochun Ye, Dongrui Fan

TL;DR
This paper introduces a multilayer dataflow architecture optimized for butterfly sparsity in attention mechanisms, achieving significant speedup and energy efficiency improvements over existing accelerators.
Contribution
It proposes a hybrid butterfly-sparsity network and a scalable multilayer dataflow method to enhance attention computation efficiency on reconfigurable architectures.
Findings
Up to 14.34x speedup compared to Jetson Xavier NX.
11.14x energy efficiency improvement in attention workloads.
2.38x to 4.7x efficiency gains over state-of-the-art accelerators.
Abstract
Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintaining model accuracy. However, its complicated data accessing pattern brings utilization degradation and makes parallelism hard to exploit in general block-oriented architecture like GPU. Since the reconfigurable dataflow architecture is known to have better data reusability and architectural flexibility in general NN-based acceleration, we want to apply it to the butterfly sparsity for acquiring better computational efficiency for attention workloads. We first propose a hybrid butterfly-sparsity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
