FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection
Ziyu Huang, Yangjie Zhou, Zihan Liu, Xinhao Luo, Yijia Diao, Minyi Guo, Jidong Zhai, Yu Feng, Chen Zhang, Anbang Wu, Jingwen Leng

TL;DR
FlashFuser is a novel compiler framework that leverages inter-core GPU connections to expand kernel fusion capabilities, significantly reducing memory access and boosting performance for compute-intensive deep learning workloads.
Contribution
It introduces a DSM-based communication abstraction, a dataflow analyzer for distributed memory, and a unified search engine for optimal kernel fusion planning on modern GPUs.
Findings
Reduces memory access by 58% on NVIDIA H100
Achieves 3.3x speedup over tuned libraries
Attains 4.1x speedup over state-of-the-art compilers
Abstract
The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing compilers and frameworks are limited to using local scratchpad memory. When the intermediate results exceed the limited capacity (such as FFN), the fusion fails. Although modern GPUs (like the NVIDIA H100) now incorporate an inter-core connection mechanism known as Distributed Shared Memory(DSM)--providing a larger, high-bandwidth, and low-latency on-chip memory pool--this hardware potential has yet to be exploited by software frameworks. To bridge this gap, we present FlashFuser, the first compiler framework to utilize inter-core connection for kernel fusion on modern GPUs. FlashFuser extends established fusion techniques to the DSM domain through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies
