SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems
Xin Wang, Pietro Lodi Rizzini, Sourav Medya, Zhiling Lan

TL;DR
This paper introduces urmodel, a surrogate model combining GNNs and LLMs to accurately predict application runtime in Dragonfly networks, facilitating efficient hybrid simulation and analysis of workload interference.
Contribution
The paper presents a novel surrogate model that integrates graph neural networks and large language models for precise runtime prediction in Dragonfly systems, surpassing existing methods.
Findings
urmodel outperforms baseline models in accuracy.
It enables efficient hybrid simulation of Dragonfly networks.
Supports real-time workload interference analysis.
Abstract
The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Interconnection Networks and Systems · Software System Performance and Reliability
