Enhancing reliability in AI inference services: An empirical study on real production incidents
Bhala Ranganathan, Mickey Zhang, Kai Wu

TL;DR
This paper presents an empirical analysis of large language model inference incidents in production, identifying failure modes and mitigation strategies to improve reliability and automation in cloud-based AI services.
Contribution
It introduces a taxonomy and methodology for analyzing inference incidents, validated on real data, and provides practical strategies and a checklist for practitioners to enhance system robustness.
Findings
60% inference engine failures in incidents
40% of failures due to timeouts
74% incidents auto-detected and mitigated
Abstract
Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Machine Learning in Materials Science
