Enhancing reliability in AI inference services: An empirical study on real production incidents

Bhala Ranganathan; Mickey Zhang; Kai Wu

arXiv:2511.07424·cs.DC·November 12, 2025

Enhancing reliability in AI inference services: An empirical study on real production incidents

Bhala Ranganathan, Mickey Zhang, Kai Wu

PDF

Open Access

TL;DR

This paper presents an empirical analysis of large language model inference incidents in production, identifying failure modes and mitigation strategies to improve reliability and automation in cloud-based AI services.

Contribution

It introduces a taxonomy and methodology for analyzing inference incidents, validated on real data, and provides practical strategies and a checklist for practitioners to enhance system robustness.

Findings

01

60% inference engine failures in incidents

02

40% of failures due to timeouts

03

74% incidents auto-detected and mitigated

Abstract

Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Machine Learning in Materials Science