Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators

Aofeng Shen; Chi Zhang; Yakup Budanaz; Alexandru Calotoiu; Torsten Hoefler; Luca Benini

arXiv:2512.13638·cs.DC·December 16, 2025

Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators

Aofeng Shen, Chi Zhang, Yakup Budanaz, Alexandru Calotoiu, Torsten Hoefler, Luca Benini

PDF

Open Access

TL;DR

This paper introduces 'Design in Tiles', an automated framework that simplifies deploying GEMM on tile-based many-PE accelerators, achieving higher utilization and speedup compared to expert-tuned libraries.

Contribution

It presents a novel automated deployment framework for tile-based accelerators, bridging hardware design and software mapping for efficient GEMM execution.

Findings

01

Achieves 1.2-2.0x speedup over NVIDIA GH200 GEMM libraries.

02

Higher PE utilization than expert-tuned libraries.

03

Supports large acceleration configurations (e.g., 32x32 tiles, 1979 TFLOPS@FP8).

Abstract

Tile-based many-Processing Element (PE) accelerators can achieve competitive performance on General Matrix Multiplication (GEMM), but they are extremely hard to program, as their optimal software mapping is deeply coupled with hardware design which is unwieldy to manual deployment. We propose "Design in Tiles (DiT)", an automated framework connecting a deployment toolchain with a configurable executable model for these accelerators. For evaluation, we apply our framework to GEMM targeting a large acceleration configuration (e.g., 32x32 tiles, 1979 TFLOPS@FP8, 4 TB/s Bandwidth) comparable to an NVIDIA GH200. We achieve higher PE utilization than GH200 with its expert-tuned GEMM libraries, achieving 1.2-2.0x speedup across diverse matrix shapes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Network Packet Processing and Optimization