On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories

Henri A\"idasso; Francis Bordeleau; Ali Tizghadam

arXiv:2501.04976·cs.SE·August 26, 2025

On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories

Henri A\"idasso, Francis Bordeleau, Ali Tizghadam

PDF

Open Access

TL;DR

This paper analyzes thousands of flaky job failures in a telecommunications setting to categorize and prioritize them using RFM analysis, aiming to improve automated diagnosis and repair strategies.

Contribution

It introduces a novel RFM-based approach to categorize and prioritize flaky failure types, focusing on their evolution and impact for better diagnosis.

Findings

01

Identified 46 flaky failure categories in TELUS jobs.

02

Prioritized 14 categories for future automated diagnosis.

03

Provided insights into failure category evolution and impact.

Abstract

The continuous delivery of modern software requires the execution of many automated pipeline jobs. These jobs ensure the frequent release of new software versions while detecting code problems at an early stage. For TELUS, our industrial partner in the telecommunications field, reliable job execution is crucial to minimize wasted time and streamline Continuous Deployment (CD). In this context, flaky job failures are one of the main issues hindering CD. Prior studies proposed techniques based on machine learning to automate the detection of flaky jobs. While valuable, these solutions are insufficient to address the waste associated with the diagnosis of flaky failures, which remain largely unexplored due to the wide range of underlying causes. This study examines 4,511 flaky job failures at TELUS to identify the different categories of flaky failures that we prioritize based on Recency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReliability and Maintenance Optimization