MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Mohammadali Shakerdargah; Shan Lu; Chao Gao; Di Niu

arXiv:2411.17720·cs.DC·May 19, 2025

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu

PDF

Open Access

TL;DR

This paper introduces MAS-Attention, a method for accelerating attention inference on resource-limited edge devices by parallelizing heterogeneous compute units and optimizing workload scheduling, achieving significant speedups and energy savings.

Contribution

The paper proposes a novel multi-tiered tiling and workload scheduling scheme for exact attention acceleration on edge accelerators, addressing memory and compute constraints.

Findings

01

Up to 2.75x speedup and 54% energy reduction compared to FLAT.

02

Achieves up to 1.76x speedup on real hardware without accuracy loss.

03

Effective workload scheduling and cache strategies improve edge attention processing.

Abstract

The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies

MethodsSoftmax · Attention Is All You Need