Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
Aofeng Shen, Chi Zhang, Yakup Budanaz, Alexandru Calotoiu, Torsten Hoefler, Luca Benini

TL;DR
This paper introduces 'Design in Tiles', an automated framework that simplifies deploying GEMM on tile-based many-PE accelerators, achieving higher utilization and speedup compared to expert-tuned libraries.
Contribution
It presents a novel automated deployment framework for tile-based accelerators, bridging hardware design and software mapping for efficient GEMM execution.
Findings
Achieves 1.2-2.0x speedup over NVIDIA GH200 GEMM libraries.
Higher PE utilization than expert-tuned libraries.
Supports large acceleration configurations (e.g., 32x32 tiles, 1979 TFLOPS@FP8).
Abstract
Tile-based many-Processing Element (PE) accelerators can achieve competitive performance on General Matrix Multiplication (GEMM), but they are extremely hard to program, as their optimal software mapping is deeply coupled with hardware design which is unwieldy to manual deployment. We propose "Design in Tiles (DiT)", an automated framework connecting a deployment toolchain with a configurable executable model for these accelerators. For evaluation, we apply our framework to GEMM targeting a large acceleration configuration (e.g., 32x32 tiles, 1979 TFLOPS@FP8, 4 TB/s Bandwidth) comparable to an NVIDIA GH200. We achieve higher PE utilization than GH200 with its expert-tuned GEMM libraries, achieving 1.2-2.0x speedup across diverse matrix shapes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Network Packet Processing and Optimization
