Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

Dominik Scheinert; Alexander Acker; Thorsten Wittkopp; Soeren Becker; Hamza Yous; Karnakar Reddy; Ibrahim Farhat; Hakim Hacid; Odej Kao

arXiv:2603.02057·cs.DC·March 3, 2026

Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

Dominik Scheinert, Alexander Acker, Thorsten Wittkopp, Soeren Becker, Hamza Yous, Karnakar Reddy, Ibrahim Farhat, Hakim Hacid, Odej Kao

PDF

Open Access

TL;DR

This paper evaluates existing root cause analysis methods on GPU-driven large language model workloads, revealing their limitations and proposing tailored analysis techniques for improved failure diagnosis in complex LLM deployments.

Contribution

The study systematically assesses 24 RCA methods on LLM workloads, highlighting the need for specialized techniques and providing guidelines for better observability and fault localization.

Findings

01

Multi-source RCA approaches are most accurate.

02

Metric-based methods vary with fault type.

03

Trace-based methods largely fail on LLM workloads.

Abstract

Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Software Engineering Research