A Parallel Scan Algorithm in the Tensor Core Unit Model

Anastasios Zouzias; William F. McColl

arXiv:2411.17887·cs.DC·November 28, 2024

A Parallel Scan Algorithm in the Tensor Core Unit Model

Anastasios Zouzias, William F. McColl

PDF

TL;DR

This paper introduces a parallel scan algorithm tailored for the Tensor Core Unit model, leveraging matrix multiplication as a fundamental operation to optimize prefix sum computations.

Contribution

It proposes a novel parallel scan algorithm within the TCU model, analyzing its depth and runtime based on matrix multiplication operations.

Findings

01

Algorithm achieves depth at most 2*log_s(n)

02

Runs in O(n(1 + l/s^2)/p + (s^2 + l) log_s(n)) time

03

Performs O(n/s^2) matrix multiplications

Abstract

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size $s$ is a basic operation. In the $(s^{2}, ℓ)$ -TCU model, we show that for inputs of size $n$ , the algorithm has depth at most $2 ⌊ lo g_{s} (n)⌋$ and runs in $O (n (1 + ℓ / s^{2}) / p + (s^{2} + ℓ) lo g_{s} (n))$ time assuming $p$ tensor core units. Equivalently, the algorithm performs $O (n / s^{2})$ multiplications of square matrices of size s.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.