Stream-K Optimization and Exploration
Nick Rackley, Bryan Gonzalez, Casey Morrison

TL;DR
This paper investigates optimization strategies for the Stream-K matrix multiplication algorithm, focusing on padding effects, block size tuning, and runtime prediction improvements, with some performance gains but unresolved issues.
Contribution
It introduces specific optimization techniques for Stream-K, examines padding impacts, and explores Block2Time for better runtime prediction and load balancing.
Findings
Padding zeroing improves performance by 0.6%
Adjusting block size can cause process stalls
Block2Time shows potential for runtime prediction
Abstract
We explore optimization options for the Stream-K algorithm, a work-centric parallelization of matrix multiplication (GEMM). In our study, we investigated differences between the theoretical and practical implementations, particularly noting the impact of padding. Our debugging efforts revealed a persistent bug related to block mapping, which we could not fully resolve, but we managed to implement some optimizations. Setting the padding to zero for the M, N, and K dimensions resulted in an average 0.6% improvement in performance, achieving 1.44 ms, 89.37 TFlops, and 66.91 GB/s. However, adjusting the block size and parameters led to the process getting stuck, indicating a need for further tuning. Additionally, exploring the potential of Block2Time highlighted its promise in enhancing runtime predictions and optimizing load balancing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
