RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference

George Karfakis; Faraz Tahmasebi; Binglu Chen; Lime Yao; Saptarshi Mitra; Tianyue Pan; Hyoukjun Kwon; Puneet Gupta

arXiv:2512.19606·cs.PF·December 23, 2025

RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference

George Karfakis, Faraz Tahmasebi, Binglu Chen, Lime Yao, Saptarshi Mitra, Tianyue Pan, Hyoukjun Kwon, Puneet Gupta

PDF

Open Access

TL;DR

RAPID-LLM is a comprehensive framework that models the performance of large language model training and inference on GPU clusters, accounting for hardware details, network congestion, and faults, enabling efficient system analysis.

Contribution

It introduces a unified, detailed performance modeling framework combining operator-level latency estimation with network simulation, supporting fault analysis and design exploration for distributed LLM workloads.

Findings

01

Predicts Llama inference latency within 10.4",

02

matches ns-3 results within 8",

03

enables fast configuration sweeps and fault sensitivity analysis.

Abstract

RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies. Across A100-based validation cases, RAPID-LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Software-Defined Networks and 5G · Software System Performance and Reliability