Automatic Horizontal Fusion for GPU Kernels
Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long

TL;DR
This paper introduces automatic horizontal fusion, a new GPU kernel optimization technique that enhances thread-level parallelism to improve performance, demonstrated by a tool called HFuse with significant speedups.
Contribution
It proposes a novel horizontal fusion method to complement existing kernel fusion techniques, implemented in HFuse, improving GPU kernel performance by increasing parallelism.
Findings
Horizontal fusion speeds up GPU kernels by 2.5%-60.8%.
Horizontal fusion benefits kernels with diverse resource requirements.
HFuse effectively automates the horizontal fusion process.
Abstract
We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our horizontal fusion technique aims to increase the thread-level parallelism to hide instruction latencies. We also present HFuse, a new source to source CUDA compiler that implements automatic horizontal fusion. Our experimental results show that horizontal fusion can speed up the running time by 2.5%-60.8%. Our results reveal that the horizontal fusion is especially beneficial for fusing kernels with instructions that require different kinds of GPU resources (e.g., a memory-intensive kernel and a compute-intensive kernel).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Image and Video Retrieval Techniques
