An Empirical Characterization of Outages and Incidents in Public   Services for Large Language Models

Xiaoyu Chu; Sacheendra Talluri; Qingxian Lu; Alexandru Iosup

arXiv:2501.12469·cs.PF·March 18, 2025

An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models

Xiaoyu Chu, Sacheendra Talluri, Qingxian Lu, Alexandru Iosup

PDF

Open Access 1 Repo

TL;DR

This paper empirically analyzes outages and failure-recovery patterns in public large language model services, revealing key differences and periodicities to inform better system design and usage.

Contribution

It provides the first comprehensive empirical characterization of outages and failure-recovery in major public LLM services, with detailed statistical analysis and publicly available datasets.

Findings

01

OpenAI's ChatGPT failures are less frequent but take longer to resolve.

02

Service failures show strong weekly and monthly periodicity.

03

OpenAI services have better failure-isolation than Anthropic services.

Abstract

People and businesses increasingly rely on public LLM services, such as ChatGPT, DALLE, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

atlarge-research/llm-service-analysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI and HR Technologies