FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
Zhenheng Tang, Xueze Kang, Yiming Yin, Xinglin Pan, Yuxin Wang, Xin, He, Qiang Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Amelie Chi Zhou, Bo, Li, Bingsheng He, Xiaowen Chu

TL;DR
FusionLLM introduces a decentralized system for training large language models on geo-distributed GPUs, utilizing a DAG-based model representation and adaptive compression to improve efficiency and scalability across heterogeneous hardware and networks.
Contribution
The paper presents a novel decentralized training system with a DAG-based model representation, adaptive compression, and optimized scheduling for geo-distributed GPU clusters, addressing key efficiency challenges.
Findings
Achieves 1.45-9.39x speedup over baseline methods.
Supports flexible model definitions and heterogeneous hardware.
Ensures convergence while improving training throughput.
Abstract
To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Robotics and Automated Systems · Mobile Agent-Based Network Management
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Adam · Attention Dropout
