Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms
Liqun Shao, Yiwen Zhu, Abhiram Eswaran, Kristin Lieber, Janhavi, Mahajan, Minsoo Thigpen, Sudhir Darbha, Siqi Liu, Subru Krishnan, Soundar, Srinivasan, Carlo Curino, Konstantinos Karanasos

TL;DR
Griffon is a system that automatically identifies causes of job slowdowns in cloud platforms by predicting runtimes and analyzing feature contributions, reducing manual effort and improving accuracy.
Contribution
This work introduces Griffon, a novel approach that uses regression and feature importance to diagnose job slowdowns without requiring labeled data.
Findings
Accurately identifies root causes consistent with expert validation.
Reduces investigation time compared to manual analysis.
Successfully deployed in a large-scale production environment.
Abstract
Microsoft's internal big data analytics platform is comprised of hundreds of thousands of machines, serving over half a million jobs daily, from thousands of users. The majority of these jobs are recurring and are crucial for the company's operation. Although administrators spend significant effort tuning system performance, some jobs inevitably experience slowdowns, i.e., their execution time degrades over previous runs. Currently, the investigation of such slowdowns is a labor-intensive and error-prone process, which costs Microsoft significant human and machine resources, and negatively impacts several lines of businesses. In this work, we present Griffin, a system we built and have deployed in production last year to automatically discover the root cause of job slowdowns. Existing solutions either rely on labeled data (i.e., resolved incidents with labeled reasons for job…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
