An Online Probabilistic Distributed Tracing System
M. Toslali, S. Qasim, S. Parthasarathy, F. A. Oliveira, H. Huang, G., Stringhini, Z. Liu, A. K. Coskun

TL;DR
This paper introduces Astraea, an online probabilistic distributed tracing system that uses Bayesian learning and multi-armed bandits to reduce tracing overhead while maintaining diagnostic accuracy in cloud environments.
Contribution
Astraea is the first system to adaptively steer distributed tracing instrumentation using probabilistic models, significantly lowering overhead without sacrificing diagnostic utility.
Findings
Reduces tracing overhead to 10-28% of instrumentation
Decreases storage and compute costs substantially
Maintains high accuracy in performance diagnosis
Abstract
Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
