Availability Analysis of Redundant and Replicated Cloud Services with Bayesian Networks
Otto Bibartiu (1), Frank D\"urr (1), Kurt Rothermel (1), Beate, Ottenw\"alder (2), Andreas Grau (2) ((1) University of Stuttgart, (2) Robert, Bosch GmbH)

TL;DR
This paper presents a Bayesian network-based approach to assess the availability of large-scale redundant and replicated cloud services, considering infrastructure and communication failures simultaneously, with an automatic modeling formalism.
Contribution
It introduces a high-level formalism for automatic Bayesian network modeling of complex cloud service availability, addressing cascading and common-cause failures.
Findings
Feasibility demonstrated through performance evaluations
Applicable to large-scale, redundant, and replicated cloud services
Can be extended to local and geo-distributed systems
Abstract
Due to the growing complexity of modern data centers, failures are not uncommon any more. Therefore, fault tolerance mechanisms play a vital role in fulfilling the availability requirements. Multiple availability models have been proposed to assess compute systems, among which Bayesian network models have gained popularity in industry and research due to its powerful modeling formalism. In particular, this work focuses on assessing the availability of redundant and replicated cloud computing services with Bayesian networks. So far, research on availability has only focused on modeling either infrastructure or communication failures in Bayesian networks, but have not considered both simultaneously. This work addresses practical modeling challenges of assessing the availability of large-scale redundant and replicated services with Bayesian networks, including cascading and common-cause…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Reliability and Analysis Research · Cloud Computing and Resource Management
