Data movement limits to frontier model training
Ege Erdil, David Schneider-Joseph

TL;DR
This paper presents a theoretical analysis of data movement bottlenecks in large-scale distributed training, highlighting fundamental limits to scaling models beyond certain computational thresholds within three years.
Contribution
It introduces a model to analyze how data movement constraints impact the scalability of dense and sparse training runs, identifying key thresholds and potential strategies for larger-scale training.
Findings
Data movement bottlenecks significantly reduce hardware utilization beyond 10^28 FLOP.
Training runs exceeding 10^31 FLOP are infeasible due to data movement limits.
Aggressive batch scaling and model shape adjustments could enable larger training runs.
Abstract
We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
