FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Honglin Zhu; Jiaping Cao; Jiang Shao; Siyuan Feng; Qian Qiu; Peng Chen; Xu Zhang; Yixian Zhou; Man Lung Yiu; Guang Ji; Minwen Deng; Wenxi Zhu; Jintao Meng

arXiv:2605.06057·cs.DC·May 13, 2026

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Honglin Zhu, Jiaping Cao, Jiang Shao, Siyuan Feng, Qian Qiu, Peng Chen, Xu Zhang, Yixian Zhou, Man Lung Yiu, Guang Ji, Minwen Deng, Wenxi Zhu, Jintao Meng

PDF

TL;DR

FalconGEMM is a cross-platform framework that automates deployment and optimization of lower-complexity matrix multiplication algorithms, achieving performance surpassing traditional libraries across diverse hardware for deep learning workloads.

Contribution

It introduces a comprehensive framework with deployment, execution, and decision modules to enable practical, peak-breaking matrix multiplication across heterogeneous hardware environments.

Findings

01

Outperforms GEMM libraries like cuBLAS and CUTLASS by up to 17.85%.

02

Outperforms LCMA competitors like AlphaTensor by up to 55.61%.

03

Delivers peak breaking performance on various hardware architectures.

Abstract

Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.