Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi,, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

TL;DR
This paper empirically investigates large-scale distributed training of neural networks, revealing that beyond a certain point, communication overhead and diminishing returns limit the efficiency of scaling models and hardware resources.
Contribution
It provides a comprehensive empirical analysis of hardware configurations and parallelization strategies, highlighting their impact on large-scale model training efficiency.
Findings
Communication overhead affects parallelization strategy effectiveness.
Diminishing returns occur when scaling hardware resources.
Proper optimization does not eliminate diminishing returns.
Abstract
Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Simulation Techniques and Applications
