FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
Honglin Zhu, Jiaping Cao, Jiang Shao, Siyuan Feng, Qian Qiu, Peng Chen, Xu Zhang, Yixian Zhou, Man Lung Yiu, Guang Ji, Minwen Deng, Wenxi Zhu, Jintao Meng

TL;DR
FalconGEMM is a cross-platform framework that automates deployment and optimization of lower-complexity matrix multiplication algorithms, achieving performance surpassing traditional libraries across diverse hardware for deep learning workloads.
Contribution
It introduces a comprehensive framework with deployment, execution, and decision modules to enable practical, peak-breaking matrix multiplication across heterogeneous hardware environments.
Findings
Outperforms GEMM libraries like cuBLAS and CUTLASS by up to 17.85%.
Outperforms LCMA competitors like AlphaTensor by up to 55.61%.
Delivers peak breaking performance on various hardware architectures.
Abstract
Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
