# RandMScan: accelerating parallel scan via matrix computation and random-jump strategy

**Authors:** Shujun Peng, Xinhan Lin, Yu Zhang, Yuheng Xiao, Yang Hu

PMC · DOI: 10.1038/s41598-025-32283-5 · Scientific Reports · 2025-12-24

## TL;DR

RandMScan is a new parallel scan framework optimized for AI accelerators, offering faster processing and reduced communication overhead for long sequences.

## Contribution

RandMScan introduces a two-stage framework combining matrix computation and a random-jump strategy for efficient parallel scan on AI accelerators.

## Key findings

- RandMScan achieves over 80% speedup compared to existing matrix-based implementations for long sequences.
- The method reduces end-to-end latency by 15%–26% in downstream applications.
- The framework effectively mitigates communication overhead in AI accelerators.

## Abstract

Parallel scan is a fundamental primitive widely used in a broad range of workloads, including parallel sorting, graph algorithms, and sampling in large language model inference. Although GPU-optimized parallel scan algorithms have been extensively studied, their reliance on vector units makes them inefficient on modern AI accelerators, which typically lack such units but incorporate abundant processing element (PE) arrays tailored for matrix operations. Existing matrix-based approaches, however, suffer from excessive bandwidth consumption due to auxiliary matrix loading as well as costly inter-block communication, thereby limiting their scalability in practical deployments. In this paper, we propose RandMScan, a two-stage parallel scan framework specifically designed for AI accelerators equipped with PE arrays, to address the aforementioned challenges. The first stage employs an efficient matrix-based local chunk scan algorithm that fully exploits fine-grained parallelism within PE arrays, while the second stage introduces a lightweight Random-Jump strategy to coordinate global aggregation with reduced synchronization overhead. Together, these techniques enable scalable execution over long sequences while effectively mitigating the substantial communication overhead that typically arises in prior solutions. Extensive evaluations on state-of-the-art AI accelerators demonstrate that our method achieves more than 80% speedup compared to existing matrix-based implementations for long input sequences, and can reduce the end-to-end latency by 15%–26% in representative downstream applications of scan.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12820164/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12820164/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12820164/full.md

---
Source: https://tomesphere.com/paper/PMC12820164