LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

TL;DR
LIBRA is a framework that optimizes multi-dimensional network topologies to reduce communication bottlenecks in distributed training of large AI models, improving resource utilization and training efficiency.
Contribution
The paper introduces LIBRA, a novel framework for optimizing multi-dimensional network fabrics tailored for distributed AI training, addressing bandwidth allocation and architecture design.
Findings
LIBRA effectively enhances network bandwidth utilization.
Optimized fabrics reduce training communication overhead.
Framework enables co-optimization of network architecture and training performance.
Abstract
As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of multi-dimensional networks within machine learning systems as a cost-efficient mechanism to enhance overall network bandwidth. We also identify that optimal bandwidth allocation is pivotal for multi-dimensional networks to ensure efficient resource utilization. We introduce LIBRA, a framework specifically focused on optimizing multi-dimensional fabric architectures. Through case studies, we demonstrate the value of LIBRA, both in architecting optimized fabrics under diverse constraints and in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and Data Classification · Speech Recognition and Synthesis
