SWARM+: Scalable and Resilient Multi-Agent Consensus for Fully-Decentralized Data-Aware Workload Management
Komal Thareja, Krishnan Raghavan, Anirban Mandal, Ewa Deelman

TL;DR
SWARM+ introduces a scalable, resilient, and data-aware decentralized consensus system for managing distributed scientific workloads across heterogeneous resources, improving efficiency and fault tolerance in large-scale environments.
Contribution
It presents novel algorithms that enhance scalability, resilience, and data-awareness in multi-agent workload management, validated through extensive experiments on a distributed testbed.
Findings
Scales to 1000 agents with balanced workload distribution
Maintains >99% job completion under single agent failure
Achieves 97-98% improvement in selection time and scheduling latency
Abstract
Distributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo-distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck -- introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi-agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer-to-peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data-aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Cloud Computing and Resource Management
