Prediction-Guided Control in Data Center Networks
Kevin Zhao, Chenning Li, Anton A. Zabreyko, Arash Nasr-Esfahany, Anna Goncharenko, David Dai, Sidharth Lakshmanan, Claire Li, Mohammad Alizadeh, Thomas E. Anderson

TL;DR
Polyphony is a system that enables data center network operators to predict and control network quality of service, reducing tail latency events within minutes by dynamically adjusting configurations based on workload predictions.
Contribution
This paper introduces Polyphony, a novel system that combines workload monitoring, counterfactual prediction, and closed-loop control to adapt network configurations in real-time, outperforming prior static and model-free methods.
Findings
Polyphony converges to network SLOs within ten minutes.
It re-stabilizes after workload shifts within fifteen minutes.
Polyphony outperforms prior state-of-the-art methods in dynamic adaptation.
Abstract
In this paper, we design, implement, and evaluate Polyphony, a system to give network operators a new way to control and reduce the frequency of poor tail latency events in multi-class data center networks, on the time scale of minutes. Polyphony is designed to be complementary to other adaptive mechanisms like congestion control and traffic engineering, but targets different aspects of network operation that have previously been considered static. By contrast to Polyphony, prior model-free optimization methods work best when there are only a few relevant degrees of freedom and where workloads and measurements are stable, assumptions not present in modern data center networks. Polyphony develops novel methods for measuring, predicting, and controlling network quality of service metrics for a dynamically changing workload. First, we monitor and aggregate workloads on a network-wide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Software System Performance and Reliability
