FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng,, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin, Lin, Xin Jin, Xin Liu

TL;DR
Flux is a GPU kernel fusion technique that significantly overlaps communication and computation in large-scale deep learning, achieving notable speedups in training and inference by hiding communication latencies.
Contribution
Flux introduces a novel kernel fusion method that over-decomposes and fuses communication and computation to effectively hide communication latency on GPUs.
Findings
Achieves up to 1.24x training speedup on Megatron-LM
Overlaps up to 96% of communication with computation
Provides up to 1.66x inference speedup on vLLM
Abstract
Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Brain Tumor Detection and Classification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
