Parallel Scan on Ascend AI Accelerators
Bart{\l}omiej Wr\'oblewski, Gioele Gottardo, Anastasios Zouzias

TL;DR
This paper develops and evaluates parallel prefix sum algorithms optimized for Ascend AI accelerators, leveraging their specialized units to significantly improve performance of AI workload operators.
Contribution
It introduces novel scan algorithms that extensively utilize matrix multiplications on Ascend accelerators, achieving substantial speedups over vector-only implementations.
Findings
Single-core speedups of 5x to 9.6x for large inputs.
Multi-core algorithm reaches 74.9% of memory bandwidth.
Radix sort with matrix multiplications achieves 3.3x speedup.
Abstract
We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized vector operations. A key feature of the proposed scan algorithms is their extensive use of matrix multiplications and accumulations enabled by the cube unit. To showcase the effectiveness of these algorithms, we also implement and evaluate several scan-based operators commonly used in AI workloads, including sorting, tensor masking, and top- / top- sampling. Our single-core results demonstrate substantial performance improvements, with speedups ranging from to compared to vector-only implementations for sufficiently large input lengths. Additionally, we present a multi-core scan algorithm that fully utilizes both the cube and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Evolutionary Algorithms and Applications
