A Statistical Approach to Performance Monitoring in Soft Real-Time Distributed Systems
Danny Bickson, Gidon Gershinsky, Ezra N. Hoch, Konstantin Shagin

TL;DR
This paper introduces a novel statistical and machine learning-based framework for real-time performance monitoring and root-cause analysis in soft real-time distributed systems, improving latency management and system reliability.
Contribution
It presents a general, scalable approach using recent distributed algorithms for online performance monitoring and root-cause detection, applicable beyond soft real-time systems.
Findings
Successfully implemented in IBM's TransFab messaging fabric
Effectively identifies causes of latency violations
Resolves resource allocation issues in real-time
Abstract
Soft real-time applications require timely delivery of messages conforming to the soft real-time constraints. Satisfying such requirements is a complex task both due to the volatile nature of distributed environments, as well as due to numerous domain-specific factors that affect message latency. Prompt detection of the root-cause of excessive message delay allows a distributed system to react accordingly. This may significantly improve compliance with the required timeliness constraints. In this work, we present a novel approach for distributed performance monitoring of soft-real time distributed systems. We propose to employ recent distributed algorithms from the statistical signal processing and learning domains, and to utilize them in a different context of online performance monitoring and root-cause analysis, for pinpointing the reasons for violation of performance requirements.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-Time Systems Scheduling · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
