Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform
Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella

TL;DR
This paper presents a non-intrusive run-time failure detection method for large-scale cloud systems that uses event analysis to identify silent failures more effectively than existing logging approaches.
Contribution
It introduces a lightweight, non-intrusive event tracing approach that builds fault-free monitoring rules without system modifications, improving failure detection in cloud platforms.
Findings
Detects failures with an F1 score of 0.85, outperforming existing mechanisms.
Reduces failure detection time to approximately 114 seconds.
Achieves higher accuracy (0.77) compared to traditional logging (0.50).
Abstract
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system's internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and "off-the-shelf" distributed system, by executing a campaign of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
