RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference
George Karfakis, Faraz Tahmasebi, Binglu Chen, Lime Yao, Saptarshi Mitra, Tianyue Pan, Hyoukjun Kwon, Puneet Gupta

TL;DR
RAPID-LLM is a comprehensive framework that models the performance of large language model training and inference on GPU clusters, accounting for hardware details, network congestion, and faults, enabling efficient system analysis.
Contribution
It introduces a unified, detailed performance modeling framework combining operator-level latency estimation with network simulation, supporting fault analysis and design exploration for distributed LLM workloads.
Findings
Predicts Llama inference latency within 10.4",
matches ns-3 results within 8",
enables fast configuration sweeps and fault sensitivity analysis.
Abstract
RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies. Across A100-based validation cases, RAPID-LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Software-Defined Networks and 5G · Software System Performance and Reliability
