TraceMesh: Scalable and Streaming Sampling for Distributed Traces
Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng

TL;DR
TraceMesh introduces a scalable, streaming trace sampling method using Locality-Sensitivity Hashing and dynamic clustering to improve efficiency and accuracy in distributed tracing systems, especially for high-dimensional, evolving trace data.
Contribution
The paper presents TraceMesh, a novel sampling approach that effectively handles high-dimensional, dynamic trace data using LSH and evolving clustering, outperforming existing methods.
Findings
Outperforms state-of-the-art sampling methods in accuracy
Achieves higher efficiency in trace sampling
Effectively handles unseen trace features
Abstract
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlapping and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Scientific Computing and Data Management
