Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate   Attention Computation

Haibin Wu; Wenming Li; Kai Yan; Zhihua Fan; Peiyang Wu; Yuqun Liu,; Yanhuan Liu; Ziqing Qiang; Meng Wu; Kunming Liu; Xiaochun Ye; Dongrui Fan

arXiv:2411.00734·cs.AR·November 26, 2024

Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate Attention Computation

Haibin Wu, Wenming Li, Kai Yan, Zhihua Fan, Peiyang Wu, Yuqun Liu,, Yanhuan Liu, Ziqing Qiang, Meng Wu, Kunming Liu, Xiaochun Ye, Dongrui Fan

PDF

Open Access

TL;DR

This paper introduces a multilayer dataflow architecture optimized for butterfly sparsity in attention mechanisms, achieving significant speedup and energy efficiency improvements over existing accelerators.

Contribution

It proposes a hybrid butterfly-sparsity network and a scalable multilayer dataflow method to enhance attention computation efficiency on reconfigurable architectures.

Findings

01

Up to 14.34x speedup compared to Jetson Xavier NX.

02

11.14x energy efficiency improvement in attention workloads.

03

2.38x to 4.7x efficiency gains over state-of-the-art accelerators.

Abstract

Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintaining model accuracy. However, its complicated data accessing pattern brings utilization degradation and makes parallelism hard to exploit in general block-oriented architecture like GPU. Since the reconfigurable dataflow architecture is known to have better data reusability and architectural flexibility in general NN-based acceleration, we want to apply it to the butterfly sparsity for acquiring better computational efficiency for attention workloads. We first propose a hybrid butterfly-sparsity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques