FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel   Fusion

Li-Wen Chang; Wenlei Bao; Qi Hou; Chengquan Jiang; Ningxin Zheng,; Yinmin Zhong; Xuanrun Zhang; Zuquan Song; Chengji Yao; Ziheng Jiang; Haibin; Lin; Xin Jin; Xin Liu

arXiv:2406.06858·cs.LG·October 25, 2024·3 cites

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng,, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin, Lin, Xin Jin, Xin Liu

PDF

Open Access 1 Repo

TL;DR

Flux is a GPU kernel fusion technique that significantly overlaps communication and computation in large-scale deep learning, achieving notable speedups in training and inference by hiding communication latencies.

Contribution

Flux introduces a novel kernel fusion method that over-decomposes and fuses communication and computation to effectively hide communication latency on GPUs.

Findings

01

Achieves up to 1.24x training speedup on Megatron-LM

02

Overlaps up to 96% of communication with computation

03

Provides up to 1.66x inference speedup on vLLM

Abstract

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/flux
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Brain Tumor Detection and Classification

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings