Can Asymmetric Tile Buffering Be Beneficial?
Chengyue Wang, Wesley Pang, Xinrui Wu, Gregory Jun, Luis Romero, Endri Taka, Diana Marculescu, Tony Nowatzki, Pranathi Vasireddy, Joseph Melber, Deming Chen, Jason Cong

TL;DR
This paper introduces asymmetric tile buffering (ATB) for GEMM operations, decoupling input and output tile sizes, leading to significant performance improvements in AI workloads, exemplified by a 4.54x speedup on AMD's XDNA2 AIE.
Contribution
The paper presents the novel concept of ATB, demonstrating its practicality and benefits for GEMM performance optimization in AI hardware.
Findings
ATB achieves up to 4.54x speedup on AMD XDNA2 AIE.
Performance model guides effective ATB tiling factor selection.
ATB sets a new performance record for XDNA2 AIE GEMM.
Abstract
General matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input along the dimension matches the output tile size of . In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Stochastic Gradient Optimization Techniques
