Understanding and Detecting Flaky Builds in GitHub Actions
Wenhao Ge, Chen Zhang

TL;DR
This paper conducts a large-scale empirical study of flaky builds in GitHub Actions, identifying common failure causes and proposing a machine learning method that significantly improves flaky build detection accuracy.
Contribution
It provides the first extensive analysis of flaky builds in GitHub Actions and introduces a novel ML-based detection approach with enhanced performance.
Findings
3.2% of builds are rerun, with 67.73% being flaky
Identified 15 categories of flaky failures, with tests, network, and dependencies most common
ML approach improves F1-score by up to 20.3% over baseline
Abstract
Continuous Integration (CI) is widely used to provide rapid feedback on code changes; however, CI build outcomes are not always reliable. Builds may fail intermittently due to non-deterministic factors, leading to flaky builds that undermine developers' trust in CI, waste computational resources, and threaten the validity of CI-related empirical studies. In this paper, we present a large-scale empirical study of flaky builds in GitHub Actions based on rerun data from 1,960 open-source Java projects. Our results show that 3.2% of builds are rerun, and 67.73% of these rerun builds exhibit flaky behavior, affecting 1,055 (51.28%) of the projects. Through an in-depth failure analysis, we identify 15 distinct categories of flaky failures, among which flaky tests, network issues, and dependency resolution issues are the most prevalent. Building on these findings, we propose a machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices
