HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity
Wenyi Wang, Zheng Wu, Yanmeng Wang, Haolin Mao, Lei Han, Gaogang Xie, Fu Xiao

TL;DR
HyGra is a hybrid-granularity network simulator that adaptively switches between packet-level and flow-level simulation to efficiently and accurately emulate data center network dynamics during large language model training.
Contribution
It introduces HyGra, a novel adaptive simulation approach that significantly accelerates network-state simulation for LLM training without sacrificing fidelity.
Findings
Achieves up to 15.4x speedup in simulations.
Maintains high accuracy during LLM workload emulation.
Supports existing simulators without specialized hardware.
Abstract
In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \texttt{HyGra}, a hybrid-granularity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
