FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters
Hasibul Jamil, Abdul Alim, Laurent Schares, Pavlos Maniotis, Liran, Schour, Ali Sydney, Abdullah Kayi, Tevfik Kosar, Bengi Karacali

TL;DR
FlowTracer is a diagnostic tool that analyzes network path usage imbalance in AI training clusters, helping to optimize routing and reduce congestion in distributed systems.
Contribution
FlowTracer provides detailed flow-level visibility into network utilization, enabling identification and mitigation of hash collision-induced imbalances in AI training clusters.
Findings
FlowTracer detects network flow imbalances effectively.
Using FlowTracer, a 30% reduction in imbalance was achieved.
The tool aids in optimizing routing strategies for distributed AI workloads.
Abstract
The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. FlowTracer aids in debugging network inefficiencies by providing detailed visibility into traffic distribution and helping to identify the root causes of performance degradation, such as issues caused by hash collisions. By offering flow-level insights, FlowTracer enables system operators to optimize routing, reduce congestion, and improve the performance of distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Online Learning and Analytics · Traffic Prediction and Management Techniques
