Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework
Julien Soul\'e, Jean-Paul Jamont, Michel Occello, Louis-Marie Traonouez, Paul Th\'eron

TL;DR
This paper presents an automated online framework for designing multi-agent systems to improve Kubernetes autoscaling resilience, especially under adversarial conditions like DDoS attacks.
Contribution
It introduces a novel four-phase framework for creating multi-agent autoscaling systems that decompose resilience goals and transfer learned policies from simulation to real clusters.
Findings
HPA MAS outperforms existing systems in resilience tests
The framework enables explainable agent behaviors
Policies transfer effectively from simulation to real environments
Abstract
In cloud-native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial-of-Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning-based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents, collectively forming an HPA Multi-Agent System (MAS). We introduce an automated, four-phase online framework for HPA MAS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Software System Performance and Reliability · Network Security and Intrusion Detection
MethodsMixing Adam and SGD
