Long-term Monitoring of Kernel and Hardware Events to Understand Latency Variance
Fang Zhou, Yuyang Huang, Miao Yu, Sixiang Ma, Tongping Liu, Yang Wang

TL;DR
This paper introduces VarMRI, a long-term monitoring tool that analyzes kernel and hardware events to understand latency variance in applications, revealing diverse causes and potential optimizations over extensive study periods.
Contribution
We developed VarMRI, a selective, efficient long-term monitoring system for kernel and hardware events, enabling detailed analysis of latency variance causes across diverse applications.
Findings
Latency variance caused by multiple hardware and kernel events.
Simple tuning can reduce tail latencies by up to 31%.
Long-term monitoring reveals significant variability in event impacts.
Abstract
This paper presents our experience to understand latency variance caused by kernel and hardware events, which are often invisible at the application level. For this purpose, we have built VarMRI, a tool chain to monitor and analyze those events in the long term. To mitigate the "big data" problem caused by long-term monitoring, VarMRI selectively records a subset of events following two principles: it only records events that are affecting the requests recorded by the application; it records coarse-grained information first and records additional information only when necessary. Furthermore, VarMRI introduces an analysis method that is efficient on large amount of data, robust on different data set and against missing data, and informative to the user. VarMRI has helped us to carry out a 3,000-hour study of six applications and benchmarks on CloudLab. It reveals a wide variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
