HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity

Wenyi Wang; Zheng Wu; Yanmeng Wang; Haolin Mao; Lei Han; Gaogang Xie; Fu Xiao

arXiv:2603.12671·cs.NI·March 20, 2026

HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity

Wenyi Wang, Zheng Wu, Yanmeng Wang, Haolin Mao, Lei Han, Gaogang Xie, Fu Xiao

PDF

Open Access

TL;DR

HyGra is a hybrid-granularity network simulator that adaptively switches between packet-level and flow-level simulation to efficiently and accurately emulate data center network dynamics during large language model training.

Contribution

It introduces HyGra, a novel adaptive simulation approach that significantly accelerates network-state simulation for LLM training without sacrificing fidelity.

Findings

01

Achieves up to 15.4x speedup in simulations.

02

Maintains high accuracy during LLM workload emulation.

03

Supports existing simulators without specialized hardware.

Abstract

In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \texttt{HyGra}, a hybrid-granularity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques