AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling
Hamed Hamzeh

TL;DR
This paper introduces AGMARL-DKS, a scalable, stress-aware multi-agent reinforcement learning scheduler for Kubernetes that improves resource utilization, fault tolerance, and cost efficiency in dynamic cloud environments.
Contribution
It presents a novel multi-agent RL approach with graph neural networks and lexicographical ordering for adaptive, stress-aware scheduling in Kubernetes clusters.
Findings
Outperforms default scheduler in GKE in fault tolerance, utilization, and cost
Uses multi-agent system with GNN for global context awareness
Employs stress-aware lexicographical ordering for multi-objective trade-offs
Abstract
State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
