Scalable Tail Latency Estimation for Data Center Networks

Kevin Zhao; Prateesh Goyal; Mohammad Alizadeh; and Thomas E. Anderson

arXiv:2205.01234·cs.NI·October 3, 2022·5 cites

Scalable Tail Latency Estimation for Data Center Networks

Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, and Thomas E. Anderson

PDF

Open Access

TL;DR

This paper introduces a fast, scalable method for estimating tail latency in large data center networks by decomposing the problem into parallel link simulations, achieving high accuracy without machine learning training delays.

Contribution

The authors develop a novel decomposition technique that enables rapid, accurate tail latency estimation for large networks without relying on machine learning training.

Findings

01

Estimates run in 1-2 minutes compared to hours for traditional simulators.

02

Achieves 99th percentile accuracy within 9% for flow completion times.

03

Applicable to general traffic matrices and topologies.

Abstract

In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Age of Information Optimization