ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan,, Sudarshan Srinivasan, Tushar Krishna

TL;DR
This paper extends the ASTRA-sim simulation infrastructure to model hierarchical networks and disaggregated systems, enabling scalable analysis of large-model distributed training architectures.
Contribution
It introduces new modeling capabilities for arbitrary model parallelization, multi-dimensional topologies, and advanced memory systems within ASTRA-sim.
Findings
Supports arbitrary model parallelization strategies.
Enables simulation of large-scale, heterogeneous distributed systems.
Provides accurate performance estimates for emerging training platforms.
Abstract
As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Advanced Data Storage Technologies · Stochastic Gradient Optimization Techniques
