A Deep Dive into the Google Cluster Workload Traces: Analyzing the Application Failure Characteristics and User Behaviors
Faisal Haque Bappy, Tariqul Islam, Tarannum Shaila Zaman, Raiful, Hasan, Carlos Caicedo

TL;DR
This paper analyzes Google's large-scale cloud workload traces to understand failure patterns and user behaviors, aiming to improve failure prediction and resource utilization in data centers.
Contribution
It provides a comprehensive analysis of failure characteristics and user heterogeneity in Google Cluster traces, proposing insights for early failure prediction systems.
Findings
Failed jobs have identifiable resource and scheduling patterns.
Certain users dominate job submission and failure events.
Insights can inform dynamic rescheduling to reduce failures.
Abstract
Large-scale cloud data centers have gained popularity due to their high availability, rapid elasticity, scalability, and low cost. However, current data centers continue to have high failure rates due to the lack of proper resource utilization and early failure detection. To maximize resource efficiency and reduce failure rates in large-scale cloud data centers, it is crucial to understand the workload and failure characteristics. In this paper, we perform a deep analysis of the 2019 Google Cluster Trace Dataset, which contains 2.4TiB of workload traces from eight different clusters around the world. We explore the characteristics of failed and killed jobs in Google's production cloud and attempt to correlate them with key attributes such as resource usage, job priority, scheduling class, job duration, and the number of task resubmissions. Our analysis reveals several important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing
