A Parallel Scan Algorithm in the Tensor Core Unit Model
Anastasios Zouzias, William F. McColl

TL;DR
This paper introduces a parallel scan algorithm tailored for the Tensor Core Unit model, leveraging matrix multiplication as a fundamental operation to optimize prefix sum computations.
Contribution
It proposes a novel parallel scan algorithm within the TCU model, analyzing its depth and runtime based on matrix multiplication operations.
Findings
Algorithm achieves depth at most 2*log_s(n)
Runs in O(n(1 + l/s^2)/p + (s^2 + l) log_s(n)) time
Performs O(n/s^2) matrix multiplications
Abstract
We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size is a basic operation. In the -TCU model, we show that for inputs of size , the algorithm has depth at most and runs in time assuming tensor core units. Equivalently, the algorithm performs multiplications of square matrices of size s.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
