ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems   for Large-model Training at Scale

William Won; Taekyung Heo; Saeed Rashidi; Srinivas Sridharan,; Sudarshan Srinivasan; Tushar Krishna

arXiv:2303.14006·cs.DC·April 15, 2025·1 cites

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan,, Sudarshan Srinivasan, Tushar Krishna

PDF

Open Access 3 Repos

TL;DR

This paper extends the ASTRA-sim simulation infrastructure to model hierarchical networks and disaggregated systems, enabling scalable analysis of large-model distributed training architectures.

Contribution

It introduces new modeling capabilities for arbitrary model parallelization, multi-dimensional topologies, and advanced memory systems within ASTRA-sim.

Findings

01

Supports arbitrary model parallelization strategies.

02

Enables simulation of large-scale, heterogeneous distributed systems.

03

Provides accurate performance estimates for emerging training platforms.

Abstract

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Theory and Algorithms · Advanced Data Storage Technologies · Stochastic Gradient Optimization Techniques