Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training

Wenjiao Feng; Rongxing Xiao; Zonghang Li; Hongfang Yu; Gang Sun; Long Luo; Mohsen Guizani; Qirong Ho; Steve Liu

arXiv:2505.12815·cs.DC·September 16, 2025

Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training

Wenjiao Feng, Rongxing Xiao, Zonghang Li, Hongfang Yu, Gang Sun, Long Luo, Mohsen Guizani, Qirong Ho, Steve Liu

PDF

Open Access

TL;DR

Chaos is a self-healing, autoscaling system for multi-party distributed training that efficiently handles churn and WAN heterogeneity, outperforming existing solutions in speed and resource utilization.

Contribution

The paper introduces Chaos, a novel system with formalized sharding and assignment algorithms, enabling robust, elastic training in decentralized multi-party environments.

Findings

01

Lower scale-out delay compared to Pollux, Elan, and Autoscaling

02

Handles churn events within 20ms

03

Achieves superior resource utilization and scalability

Abstract

Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control, which is misaligned with the self-governed setup where institutions can freely join and leave. This paper proposes Chaos, a multi-party distributed training system with self-healing and autoscaling, enabling robust and elastic training under churn. It speeds up autoscaling via multi-neighbor state replication and model sharding. We formalize the sharding and assignment as a MINLP that captures WAN heterogeneity, and reduce it to a tractable MILP by analyzing its monotonicity on a divisibility chain. By establishing an equivalence, we derive a greedy algorithm that follows optimality rules and yields the optimal solution in polynomial time. Chaos uses a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications