YTrace: End-to-end Performance Diagnosis in Large Cloud and Content Providers
Partha Kanuparthy, Yuchen Dai, Sudhir Pathak, Sambit Samal, Theophilus, Benson, Mojgan Ghasemi, P. P. S. Narayan

TL;DR
YTrace is a system developed at Yahoo for real-time, end-to-end diagnosis of performance issues across cloud and content delivery layers, improving user experience by identifying root causes efficiently.
Contribution
The paper introduces YTrace, a comprehensive system for diagnosing performance problems in large-scale cloud and content provider environments, combining causality discovery and near real-time analysis.
Findings
Successfully diagnosed performance issues in production environments
Enabled end-to-end causality analysis across multiple layers
Improved detection and localization of root causes in large-scale systems
Abstract
Content providers build serving stacks to deliver content to users. An important goal of a content provider is to ensure good user experience, since user experience has an impact on revenue. In this paper, we describe a system at Yahoo called YTrace that diagnoses bad user experience in near real time. We present the different components of YTrace for end-to-end multi-layer diagnosis (instrumentation, methods and backend system), and the system architecture for delivering diagnosis in near real time across all user sessions at Yahoo. YTrace diagnoses problems across service and network layers in the end-to-end path spanning user host, Internet, CDN and the datacenters, and has three diagnosis goals: detection, localization and root cause analysis (including cascading problems) of performance problems in user sessions with the cloud. The key component of the methods in YTrace is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Software-Defined Networks and 5G
