MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
Zheng Zhang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

TL;DR
MCFuser is a framework that efficiently generates high-performance fused kernels for memory-bound compute-intensive operators, significantly improving GPU performance and reducing tuning time.
Contribution
It introduces a novel approach combining high-level tiling, DAG analysis, and analytical modeling to optimize fusion of compute-intensive operators, overcoming existing limitations.
Findings
Achieves up to 5.9x speedup over leading compilers.
Reduces tuning time by over 70-fold.
Demonstrates superior performance on NVIDIA GPUs.
Abstract
Operator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the dynamicity of tensor dimension sizes could potentially lead to these operators becoming memory-bound, necessitating the generation of fused kernels, a task hindered by limited search spaces for fusion strategies, redundant memory access, and prolonged tuning time, leading to sub-optimal performance and inefficient deployment. We introduce MCFuser, a pioneering framework designed to overcome these obstacles by generating high-performance fused kernels for what we define as memory-bound compute-intensive (MBCI) operator chains. Leveraging high-level tiling expressions to delineate a comprehensive search space, coupled with Directed Acyclic Graph (DAG)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
