eBPF-Based Instrumentation for Generalisable Diagnosis of Performance Degradation
Diogo Landau, Jorge Barbosa, Nishant Saurabh

TL;DR
This paper introduces a set of 16 eBPF-based system metrics for fine-grained, application-agnostic diagnosis of performance degradation in data-intensive applications, enabling detailed root cause analysis.
Contribution
It presents a novel, comprehensive set of eBPF metrics spanning multiple kernel subsystems for diagnosing QoS issues in online data applications.
Findings
Metrics effectively identify causes of performance degradation.
Metrics reveal internal architecture constraints of applications.
Approach works across various workload and contention scenarios.
Abstract
Online Data Intensive applications (e.g. message brokers, ML inference and databases) are core components of the modern internet, providing critical functionalities to connecting services. The load variability and interference they experience are generally the main causes of Quality of Service (QoS) degradation, harming depending applications, and resulting in an impaired end-user experience. Uncovering the cause of QoS degradation requires detailed instrumentation of an application's activity. Existing generalisable approaches utilise readily available system metrics that encode interference in kernel metrics, but unfortunately, these approaches lack the required detail to pinpoint granular causes of performance degradation (e.g., lock, disk and CPU contention). In contrast, this paper explores the use of fine-grained system-level metrics to facilitate an application-agnostic diagnosis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed systems and fault tolerance
