On the Burstiness of Distributed Machine Learning Traffic
Natchanon Luangsomboon, Fahimeh Fazel, J\"org Liebeherr, Ashkan, Sobhani, Shichao Guan, Xingjun Chu

TL;DR
This paper investigates the high short-term burstiness of distributed machine learning traffic in data centers, revealing significant challenges for congestion control and flow management.
Contribution
It introduces metrics to quantify burstiness and provides empirical analysis of ResNet-50 training traffic, highlighting its impact on network performance.
Findings
Distributed ML traffic shows peak-to-mean ratios over 60:1 at 5 ms intervals.
Training software manages transmissions to prevent congestion despite high burstiness.
Extrapolation indicates significant challenges for congestion control in multi-application scenarios.
Abstract
Traffic from distributed training of machine learning (ML) models makes up a large and growing fraction of the traffic mix in enterprise data centers. While work on distributed ML abounds, the network traffic generated by distributed ML has received little attention. Using measurements on a testbed network, we investigate the traffic characteristics generated by the training of the ResNet-50 neural network with an emphasis on studying its short-term burstiness. For the latter we propose metrics that quantify traffic burstiness at different time scales. Our analysis reveals that distributed ML traffic exhibits a very high degree of burstiness on short time scales, exceeding a 60:1 peak-to-mean ratio on time intervals as long as 5~ms. We observe that training software orchestrates transmissions in such a way that burst transmissions from different sources within the same application do…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advancements in Semiconductor Devices and Circuit Design · Neural Networks and Applications
