Towards Efficient Multi-Scale Deformable Attention on NPU

Chenghuan Huang; Zhigeng Xu; Chong Sun; Chen Li; Ziyang Ma

arXiv:2505.14022·cs.PF·May 21, 2025

Towards Efficient Multi-Scale Deformable Attention on NPU

Chenghuan Huang, Zhigeng Xu, Chong Sun, Chen Li, Ziyang Ma

PDF

Open Access

TL;DR

This paper presents a hardware-aware co-design for multi-scale deformable attention on NPUs, significantly improving efficiency and training performance through optimized memory access and computation strategies.

Contribution

It introduces a co-design approach that rethinks memory and computation strategies for MSDA on NPUs, enabling efficient training and inference with hardware-aware optimizations.

Findings

01

Achieves up to 5.9x speedup in forward pass

02

Achieves up to 8.9x speedup in backward pass

03

Achieves up to 7.3x speedup in end-to-end training

Abstract

Multi-scale deformable attention (MSDA) is a flexible and powerful feature extraction mechanism for visual tasks, but its random-access grid sampling strategy poses significant optimization challenges, especially on domain-specific accelerators such as NPUs. In this work, we present a co-design approach that systematically rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture. With this co-design approach, our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations. Extensive experiments show that our solution achieves up to $5.9 \times$ (forward), $8.9 \times$ (backward), and $7.3 \times$ (end-to-end training) speedup over the grid sample-based baseline, and $1.9 \times$ , $2.4 \times$ , and $2.0 \times$ acceleration over the latest vendor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Computer Graphics and Visualization Techniques

MethodsSoftmax · Attention Is All You Need