Improving training time and GPU utilization in geo-distributed language model training
Palak (Microsoft Research India), Tella Rajashekhar Reddy (Microsoft Research India), Bhaskar Kataria (Cornell University USA), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India)

TL;DR
This paper introduces Atlas and BubbleTea, two systems that significantly reduce training time and improve GPU utilization for large language models trained across multiple data centers over WAN, addressing key scalability challenges.
Contribution
The paper presents novel workload-aware bandwidth sharing and a prefill-as-a-service approach, enabling faster training and higher GPU utilization in geo-distributed LM training.
Findings
Up to 17x faster training compared to state-of-the-art
Achieves up to 94% GPU utilization
Effectively reduces idle GPU cycles during training
Abstract
The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFocus
