AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling
Karthik Pattabiraman, Mihir Patel, Fred Lin

TL;DR
AIReSim is a discrete event simulation tool designed to evaluate and optimize reliability mechanisms in large-scale AI clusters, aiding system design and capacity planning.
Contribution
The paper introduces AIReSim, a novel simulator that enables systematic evaluation of failure management parameters in AI clusters, facilitating better reliability and resource planning.
Findings
AIReSim can identify key parameters affecting system reliability.
The simulator helps optimize knobs for different reliability tradeoffs.
Case study demonstrates effective capacity planning using AIReSim.
Abstract
Failures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
