Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training
Wenjiao Feng, Rongxing Xiao, Zonghang Li, Hongfang Yu, Gang Sun, Long Luo, Mohsen Guizani, Qirong Ho, Steve Liu

TL;DR
Chaos is a self-healing, autoscaling system for multi-party distributed training that efficiently handles churn and WAN heterogeneity, outperforming existing solutions in speed and resource utilization.
Contribution
The paper introduces Chaos, a novel system with formalized sharding and assignment algorithms, enabling robust, elastic training in decentralized multi-party environments.
Findings
Lower scale-out delay compared to Pollux, Elan, and Autoscaling
Handles churn events within 20ms
Achieves superior resource utilization and scalability
Abstract
Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control, which is misaligned with the self-governed setup where institutions can freely join and leave. This paper proposes Chaos, a multi-party distributed training system with self-healing and autoscaling, enabling robust and elastic training under churn. It speeds up autoscaling via multi-neighbor state replication and model sharding. We formalize the sharding and assignment as a MINLP that captures WAN heterogeneity, and reduce it to a tractable MILP by analyzing its monotonicity on a divisibility chain. By establishing an equivalence, we derive a greedy algorithm that follows optimality rules and yields the optimal solution in polynomial time. Chaos uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
