Workload Buoyancy: Keeping Apps Afloat by Identifying Shared Resource Bottlenecks
Oliver Larsson (1), Thijs Metsch (2), Cristian Klein (1), Erik Elmroth (1) ((1) Ume{\aa} University, (2) Unaffiliated)

TL;DR
This paper introduces buoyancy, a new abstraction that combines application and system metrics to better identify shared resource bottlenecks in multi-tenant, heterogeneous computing environments, improving workload performance analysis.
Contribution
The paper presents buoyancy, a novel holistic metric that captures resource contention and performance headroom, enabling more effective workload orchestration in complex multi-tenant systems.
Findings
Buoyancy improves bottleneck detection accuracy by 19.3% over traditional heuristics.
It provides a holistic view of resource contention and performance headroom.
Buoyancy can replace conventional metrics to enhance system observability.
Abstract
Modern multi-tenant, hardware-heterogeneous computing environments pose significant challenges for effective workload orchestration. Simple heuristics for assessing workload performance, such as CPU utilization or application-level metrics, are often insufficient to capture the complex performance dynamics arising from resource contention and noisy-neighbor effects. In such environments, performance bottlenecks may emerge in any shared system resource, leading to unexpected and difficult-to-diagnose degradation. This paper introduces buoyancy, a novel abstraction for characterizing workload performance in multi-tenant systems. Unlike traditional approaches, buoyancy integrates application-level metrics with system-level insights of shared resource contention to provide a holistic view of performance dynamics. By explicitly capturing bottlenecks and headroom across multiple resources,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
