PCRAFT: Capacity Planning for Dependable Stateless Services
Rasha Faqeh, Andr\`e Martin, Valerio Schiavoni, Pramod Bhatotia,, Pascal Felber, Christof Fetzer

TL;DR
PCRAFT is a system that combines empirical measurements and probabilistic models to optimize capacity planning for dependable stateless services, balancing availability, performance, and cost across different deployment schemes.
Contribution
It introduces a hybrid approach for capacity planning that integrates empirical data with probabilistic modeling to minimize resource use while satisfying availability constraints.
Findings
Cloud deployments need fewer nodes than on-premises.
Passive failover requires fewer nodes than active route anywhere in on-premises.
Additional integrity mechanisms improve quality but increase resource requirements.
Abstract
Fault-tolerance techniques depend on replication to enhance availability, albeit at the cost of increased infrastructure costs. This results in a fundamental trade-off: Fault-tolerant services must satisfy given availability and performance constraints while minimising the number of replicated resources. These constraints pose capacity planning challenges for the service operators to minimise replication costs without negatively impacting availability. To this end, we present PCRAFT, a system to enable capacity planning of dependable services. PCRAFT's capacity planning is based on a hybrid approach that combines empirical performance measurements with probabilistic modelling of availability based on fault injection. In particular, we integrate traditional service-level availability mechanisms (active route anywhere and passive failover) and deployment schemes (cloud and on-premises)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Software System Performance and Reliability
