TL;DR
This study empirically investigates flaky tests in Python, revealing their prevalence, causes, and the need for more reruns to reliably detect flakiness, thus extending understanding beyond Java-based research.
Contribution
It provides the first large-scale empirical analysis of flaky tests in Python, identifying key causes like order dependency and infrastructure issues, and quantifying reruns needed for detection.
Findings
Flakiness is as common in Python as in Java.
Order dependency causes 59% of flaky tests in Python.
Detecting flaky tests often requires around 170 reruns for confidence.
Abstract
Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
