Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads
Dominik Scheinert, Alexander Acker, Thorsten Wittkopp, Soeren Becker, Hamza Yous, Karnakar Reddy, Ibrahim Farhat, Hakim Hacid, Odej Kao

TL;DR
This paper evaluates existing root cause analysis methods on GPU-driven large language model workloads, revealing their limitations and proposing tailored analysis techniques for improved failure diagnosis in complex LLM deployments.
Contribution
The study systematically assesses 24 RCA methods on LLM workloads, highlighting the need for specialized techniques and providing guidelines for better observability and fault localization.
Findings
Multi-source RCA approaches are most accurate.
Metric-based methods vary with fault type.
Trace-based methods largely fail on LLM workloads.
Abstract
Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Software Engineering Research
