On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories
Henri A\"idasso, Francis Bordeleau, Ali Tizghadam

TL;DR
This paper analyzes thousands of flaky job failures in a telecommunications setting to categorize and prioritize them using RFM analysis, aiming to improve automated diagnosis and repair strategies.
Contribution
It introduces a novel RFM-based approach to categorize and prioritize flaky failure types, focusing on their evolution and impact for better diagnosis.
Findings
Identified 46 flaky failure categories in TELUS jobs.
Prioritized 14 categories for future automated diagnosis.
Provided insights into failure category evolution and impact.
Abstract
The continuous delivery of modern software requires the execution of many automated pipeline jobs. These jobs ensure the frequent release of new software versions while detecting code problems at an early stage. For TELUS, our industrial partner in the telecommunications field, reliable job execution is crucial to minimize wasted time and streamline Continuous Deployment (CD). In this context, flaky job failures are one of the main issues hindering CD. Prior studies proposed techniques based on machine learning to automate the detection of flaky jobs. While valuable, these solutions are insufficient to address the waste associated with the diagnosis of flaky failures, which remain largely unexplored due to the wide range of underlying causes. This study examines 4,511 flaky job failures at TELUS to identify the different categories of flaky failures that we prioritize based on Recency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReliability and Maintenance Optimization
