DRPC: Distributed Reinforcement Learning Approach for Scalable Resource   Provisioning in Container-based Clusters

Haoyu Bai; Minxian Xu; Kejiang Ye; Rajkumar Buyya; Chengzhong Xu

arXiv:2407.10169·cs.DC·July 16, 2024·2 cites

DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-based Clusters

Haoyu Bai, Minxian Xu, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu

PDF

Open Access

TL;DR

This paper introduces DRPC, a distributed reinforcement learning method for scalable resource provisioning in container-based microservice clusters, improving response times and reducing failures.

Contribution

It presents a novel decentralized autoscaling approach using deep reinforcement learning to enhance scalability and efficiency in microservice environments.

Findings

01

Reduces average response time by 15%

02

Decreases failed requests by 24%

03

Demonstrates scalability improvements in large-scale clusters

Abstract

Microservices have transformed monolithic applications into lightweight, self-contained, and isolated application components, establishing themselves as a dominant paradigm for application development and deployment in public clouds such as Google and Alibaba. Autoscaling emerges as an efficient strategy for managing resources allocated to microservices' replicas. However, the dynamic and intricate dependencies within microservice chains present challenges to the effective management of scaled microservices. Additionally, the centralized autoscaling approach can encounter scalability issues, especially in the management of large-scale microservice-based clusters. To address these challenges and enhance scalability, we propose an innovative distributed resource provisioning approach for microservices based on the Twin Delayed Deep Deterministic Policy Gradient algorithm. This approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Distributed systems and fault tolerance