Zen-Attention: A Compiler Framework for Dynamic Attention Folding on AMD NPUs

Aadesh Deshmukh; Venkata Yaswanth Raparti; Samuel Hsu

arXiv:2508.17593·cs.DC·August 26, 2025

Zen-Attention: A Compiler Framework for Dynamic Attention Folding on AMD NPUs

Aadesh Deshmukh, Venkata Yaswanth Raparti, Samuel Hsu

PDF

TL;DR

Zen-Attention is a framework that optimizes the deployment of transformer attention layers on AMD NPUs, significantly improving latency and efficiency by exploring layer folding, tiling, and data movement strategies.

Contribution

It introduces a systematic framework for optimizing dynamic attention layer mapping on AMD NPUs, addressing complex design space challenges for better performance and energy efficiency.

Findings

01

Up to 4x reduction in attention layer latency

02

Up to 32% improvement in end-to-end network latency

03

Enhanced mapping capabilities for varying input dimensions

Abstract

Transformer-based deep learning models are increasingly deployed on energy, and DRAM bandwidth constrained devices such as laptops and gaming consoles, which presents significant challenges in meeting the latency requirements of the models. The industry is turning to neural processing units (NPUs) for superior performance-per-watt (perf/watt); however, efficiently mapping dynamic attention layers to the NPUs remains a challenging task. For optimizing perf/watt, AMD XDNA NPUs employ software managed caches and share system memory with host. This requires substantial engineering effort to unlock efficient tiling, buffer allocation, and data movement to extract the maximum efficiency from the device. This paper introduces Zen-Attention, a framework that optimizes DRAM bandwidth utilization in the attention layer of models by systematically exploring the complex design space of layer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.