Parallel Scan on Ascend AI Accelerators

Bart{\l}omiej Wr\'oblewski; Gioele Gottardo; Anastasios Zouzias

arXiv:2505.15112·cs.DC·January 5, 2026

Parallel Scan on Ascend AI Accelerators

Bart{\l}omiej Wr\'oblewski, Gioele Gottardo, Anastasios Zouzias

PDF

Open Access

TL;DR

This paper develops and evaluates parallel prefix sum algorithms optimized for Ascend AI accelerators, leveraging their specialized units to significantly improve performance of AI workload operators.

Contribution

It introduces novel scan algorithms that extensively utilize matrix multiplications on Ascend accelerators, achieving substantial speedups over vector-only implementations.

Findings

01

Single-core speedups of 5x to 9.6x for large inputs.

02

Multi-core algorithm reaches 74.9% of memory bandwidth.

03

Radix sort with matrix multiplications achieves 3.3x speedup.

Abstract

We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized vector operations. A key feature of the proposed scan algorithms is their extensive use of matrix multiplications and accumulations enabled by the cube unit. To showcase the effectiveness of these algorithms, we also implement and evaluate several scan-based operators commonly used in AI workloads, including sorting, tensor masking, and top- $k$ / top- $p$ sampling. Our single-core results demonstrate substantial performance improvements, with speedups ranging from $5 \times$ to $9.6 \times$ compared to vector-only implementations for sufficiently large input lengths. Additionally, we present a multi-core scan algorithm that fully utilizes both the cube and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Evolutionary Algorithms and Applications