Improving training time and GPU utilization in geo-distributed language model training

Palak (Microsoft Research India); Tella Rajashekhar Reddy (Microsoft Research India); Bhaskar Kataria (Cornell University USA); Rohan Gandhi (Microsoft Research India); Karan Tandon (Microsoft Research India); Debopam Bhattacherjee (Microsoft Research India); Venkata N. Padmanabhan (Microsoft Research India)

arXiv:2411.14458·cs.DC·October 21, 2025·2 cites

Improving training time and GPU utilization in geo-distributed language model training

Palak (Microsoft Research India), Tella Rajashekhar Reddy (Microsoft Research India), Bhaskar Kataria (Cornell University USA), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India)

PDF

Open Access

TL;DR

This paper introduces Atlas and BubbleTea, two systems that significantly reduce training time and improve GPU utilization for large language models trained across multiple data centers over WAN, addressing key scalability challenges.

Contribution

The paper presents novel workload-aware bandwidth sharing and a prefill-as-a-service approach, enabling faster training and higher GPU utilization in geo-distributed LM training.

Findings

01

Up to 17x faster training compared to state-of-the-art

02

Achieves up to 94% GPU utilization

03

Effectively reduces idle GPU cycles during training

Abstract

The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus