A Deep Reinforcement Learning Framework for Optimizing Congestion   Control in Data Centers

Shiva Ketabi; Hongkai Chen; Haiwei Dong; Yashar Ganjali

arXiv:2301.12558·cs.NI·March 27, 2024

A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

Shiva Ketabi, Hongkai Chen, Haiwei Dong, Yashar Ganjali

PDF

Open Access

TL;DR

This paper introduces a multiagent reinforcement learning framework to dynamically optimize congestion control parameters in data centers, improving throughput and latency over static configurations.

Contribution

It presents a novel multiagent reinforcement learning system for real-time congestion control parameter tuning in data centers, addressing the limitations of existing online learning methods.

Findings

01

The system effectively adapts congestion control parameters in real-time.

02

Experimental results show improved throughput and latency.

03

The approach mitigates issues of static parameter settings.

Abstract

Various congestion control protocols have been designed to achieve high performance in different network environments. Modern online learning solutions that delegate the congestion control actions to a machine cannot properly converge in the stringent time scales of data centers. We leverage multiagent reinforcement learning to design a system for dynamic tuning of congestion control parameters at end-hosts in a data center. The system includes agents at the end-hosts to monitor and report the network and traffic states, and agents to run the reinforcement learning algorithm given the states. Based on the state of the environment, the system generates congestion control parameters that optimize network performance metrics such as throughput and latency. As a case study, we examine BBR, an example of a prominent recently-developed congestion control protocol. Our experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Neural Networks and Reservoir Computing