RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators
Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang Liu

TL;DR
RedFuser is a framework that automatically fuses cascaded reduction operations in AI models, significantly improving execution efficiency and surpassing existing compiler performance.
Contribution
It introduces a formal analysis methodology and an automated framework for general fusion of cascaded reductions in AI compilers.
Findings
Achieves 2x to 5x speedup over state-of-the-art compilers.
Successfully fuses diverse cascaded reduction workloads.
Matches the performance of hand-optimized kernels.
Abstract
Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Logic, programming, and type systems · Embedded Systems Design Techniques
