BugSwarm: Mining and Continuously Growing a Dataset of Reproducible   Failures and Fixes

David A. Tomassi; Naji Dmeiri; Yichen Wang; Antara Bhowmick; Yen-Chuan; Liu; Premkumar Devanbu; Bogdan Vasilescu; Cindy Rubio-Gonz\'alez

arXiv:1903.06725·cs.SE·July 24, 2019

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes

David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan, Liu, Premkumar Devanbu, Bogdan Vasilescu, Cindy Rubio-Gonz\'alez

PDF

TL;DR

BugSwarm is a toolset that automatically collects and maintains a large, diverse dataset of real-world software failures and fixes, enabling better evaluation of software quality methods.

Contribution

It introduces a scalable approach to continuously gather and archive reproducible fail-pass pairs from CI environments, creating a valuable dataset for research.

Findings

01

Collected 3,091 fail-pass pairs in Java and Python

02

Successfully automated the detection and archiving of fail-pass activities

03

Dataset is continuously growing and fully reproducible

Abstract

Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like Travis-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BugSwarm, a toolset that navigates these…

Figures7

Click any figure to enlarge with its caption.

Tables3

Table 1. TABLE I: BugSwarm ’s main metadata attributes

Attribute Type	Description
Project	GitHub slug, primary language, build system, and test framework
Reproducibility	Total number of attempts, and number of successful attempts to reproduce pair
Pull Request	Pull request #, merge timestamp, and branch
Travis-CI Job	Travis-CI build ID, Travis-CI job ID, number of executed and failed tests, names of the failed tests, trigger commit, and branch name
Image Tag	Unique image tag (simultaneously serves as a reference to a particular Docker image)

Table 2. TABLE II: Mined Fail-Pass Pairs

	Push Events					Pull Request Events
Language	Failed Jobs	All Pairs	Available	Docker	w/Image	Failed Jobs	All Pairs	Available	Docker	w/Image
Java	320,918	80,804	71,036	50,885	29,817	250,349	63,167	24,877	20,407	9,509
Python	778,738	115,084	103,175	65,924	37,199	1,190,186	188,735	62,545	45,878	24,740
Grand Total	1,099,656	195,888	174,211	116,809	67,016	1,440,535	251,902	87,422	66,285	34,249

Table 3. TABLE V: Sources of Unreproducibility

Reason	# Pairs
Failed to install dependency	59
URL no longer valid or network issue	57
Travis-CI command issue	38
Project-specific issue	22
Reproducer did not finish	14
Permission issue	4
Total	194

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

BugSwarm: Mining and Continuously Growing a Dataset of

Reproducible Failures and Fixes

David A. Tomassi2, Naji Dmeiri2, Yichen Wang2, Antara Bhowmick2

Yen-Chuan Liu2, Premkumar T. Devanbu2, Bogdan Vasilescu3, Cindy Rubio-González2

2University of California, Davis {datomassi, nddmeiri, eycwang, abhowmick, yclliu, ptdevanbu, crubio}@ucdavis.edu

3Carnegie Mellon University [email protected]

Abstract

Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like Travis-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BugSwarm, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BugSwarm toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.

Index Terms:

Bug Database, Reproducibility, Software Testing, Program Analysis, Experiment Infrastructure

I Introduction

Software defects have major impacts on the economy, on safety, and on the quality of life. Diagnosis and repair of software defects consumes a great deal of time and money. Defects can be treated more effectively, or avoided, by studying past defects and their repairs. Several software engineering sub-fields, e.g., program analysis, testing, and automatic program repair, are dedicated to developing tools, models, and methods for finding and repairing defects. These approaches, ideally, should be evaluated on realistic, up-to-date datasets of defects so that potential users have an idea of how well they work. Such datasets should contain fail-pass pairs, consisting of a failing version, which may include a test set that exposes the failure, and a passing version including changes that repair it. Given this, researchers can evaluate the effectiveness of tools that perform fault detection, localization (static or dynamic), or fault repair. Thus, research progress is intimately dependent on high-quality datasets of fail-pass pairs.

There are several desirable properties of these datasets of fail-pass pairs. First, scale: enough data to attain statistical significance on tool evaluations. Second, diversity: enough variability in the data to control for factors such as project scale, maturity, domain, language, defect severity, age, etc., while still retaining enough sample size for sufficient experimental power. Third, realism: defects reflecting actual fixes made by real-world programmers to repair real mistakes. Fourth, currency: a continuously updated defect dataset, keeping up with changes in languages, platforms, libraries, software function, etc., so that tools can be evaluated on bugs of current interest and relevance. Finally, and most crucially, defect data should be durably reproducible: defect data preserved in a way that supports durable build and behavior reproduction, robust to inevitable changes to libraries, languages, compilers, related dependencies, and even the operating system.111While it is impossible to guarantee this in perpetuity, we would like to have some designed-in resistance to change.

Some hand-curated datasets (e.g., Siemens test suite [23], the SIR repository [21], Defects4J [24]) provide artifact collections to support controlled experimentation with program analysis and testing techniques. However, these collections are curated by hand, and are necessarily quite limited in scale and diversity; others incorporate small-sized student homeworks [25], which may not reflect development by professionals. Some of these repositories often rely on seeded faults; natural faults, from real programmers, would provide more realism. At time of creation, these are (or rather were) current. However, unless augmented through continuous and expensive manual labor, currency will erode. Finally, to the extent that they have dependencies on particular versions of libraries and operating systems, their future reproducibility is uncertain.

The datasets cited above have incubated an impressive array of innovations and are well-recognized for their contribution to research progress. However, we believe that datasets of greater scale, diversity, realism, currency, and durability will lead to even greater progress. The ability to control for covariates, without sacrificing experimental power, will help tool-builders and empirical researchers obtain results with greater discernment, external validity, and temporal stability. However, how can we build larger defect datasets without heavy manual labor? Finding specific defect occurrences, and creating recompilable and runnable versions of failing and passing software is difficult for all but the most trivial systems: besides the source code, one may also need to gather specific versions of libraries, dependencies, operating systems, compilers, and other tools. This process requires a great deal of human effort. Unless this human effort can somehow be automated away, we cannot build large-scale, diverse, realistic datasets of reproducible defects that continually maintain currency. But how can we automate this effort?

We believe that the DevOps- and OSS-led innovations in cloud-based continuous integration (CI) hold the key. CI services, like Travis-CI [13], allow open-source projects to outsource integration testing. OSS projects, for various reasons, have need for continuous, automated integration testing. In addition, modern practices such as test-driven development have led to much greater abundance of automated tests. Every change to a project can be intensively and automatically tested off-site, on a cloud-based service; this can be done continually, across languages, dependencies, and runtime platforms. For example, typical GitHub projects require that each pull request (PR) be integration tested, and failures fixed, before being vetted or merged by integrators [22, 32]. In active projects, the resulting back-and-forth between PR contributors and project maintainers naturally creates many fail-pass pair records in the pull request history and overall project history.

Two key technologies underlie this capability: efficient, customizable, container-based virtualization simplifies handling of complex dependencies, and scripted CI servers allows custom automation of build and test procedures. Project maintainers create scripts that define the test environment (platforms, dependencies, etc.) for their projects; using these scripts, the cloud-based CI services construct virtualized runtimes (typically Docker containers) to build and run the tests. The CI results are archived in ways amenable to mining and analysis. We exploit precisely these CI archives, and the CI technology, to create an automated, continuously growing, large-scale, diverse dataset of realistic and durably reproducible defects.

In this paper, we present BugSwarm, a CI harvesting toolkit, together with a large, growing dataset of durably reproducible defects. The toolkit enables maintaining currency and augmenting diversity. BugSwarm exploits archived CI log records to create detailed artifacts, comprising buggy code versions, failing regression tests, and bug fixes. When a successive pair of commits, the first, whose CI log indicates a failed run, and the second, an immediately subsequent passing run, is found, BugSwarm uses the project’s CI customization scripts to create an artifact: a fully containerized virtual environment, comprising both versions and scripts to gather all requisite tools, dependencies, platforms, OS, etc. BugSwarm artifacts allow full build and test of pairs of failing/passing runs. Containerization allows these artifacts to be durably reproducible. The large scale and diversity of the projects using CI services allows BugSwarm to also capture a large, growing, diverse, and current collection of artifacts.

Specifically, we make the following contributions:

•

We present an approach that leverages CI to mine fail-pass pairs in open source projects and automatically attempts to reproduce these pairs in Docker containers (Section III).

•

We show that fail-pass pairs are frequently found in open source projects and discuss the challenges in reproducing such pairs (Section IV).

•

We provide the BugSwarm dataset of 3,091 artifacts, for Java and Python, to our knowledge the largest, continuously expanding, durably reproducible dataset of fail-pass pairs, and describe the general characteristics of the BugSwarm artifacts (Section IV).222The BugSwarm dataset is available at http://www.bugswarm.org.

We provide background and further motivation for BugSwarm in Section II. We describe limitations and future work in Section V. Finally, we discuss related work in Section VI and conclude in Section VII.

II Background and Motivation

Modern OSS development, with CI services, provides an enabling ecosystem of tools and data that support the creation of BugSwarm. Here we describe the relevant components of this ecosystem and present a motivating example.

II-A The Open-Source CI Ecosystem

Git and GitHub. Git is central to modern software development. Each project has a repository. Changes are added via a commit, which has a unique identifier, derived with a SHA-1 hash. The project history is a sequence of commits. Git supports branching. The main development line is usually maintained in a branch called master. GitHub is a web-based service hosting Git repositories. GitHub offers forking capabilities, i.e., cloning a repository but maintaining the copy online. GitHub supports the pull request (PR) development model: project maintainers decide on a case-by-case basis whether to accept a change. Specifically, a potential contributor forks the original project, makes changes, and then opens a pull request. The maintainers review the PR (and may ask for additional changes) before the request is merged or rejected.

Travis-CI Continuous Integration. Travis-CI is the most popular cloud-hosted CI service that integrates with GitHub; it can automatically build and test commits or PRs. Travis-CI is configured via settings in a .travis.yml file in the project repository, specifying all the environments in which the project should be tested. A Travis-CI build can be initiated by a push event or a pull request event. A push event occurs when changes are pushed to a project’s remote repository on a branch monitored by Travis-CI. A pull request event occurs when a PR is opened and when additional changes are committed to the PR. Travis-CI builds run a separate job for each configuration specified in the .travis.yml file. The build is marked as “passed” when all its jobs pass.

Docker. Docker is a lightweight virtual machine service that provides application isolation, immutability, and customization. An application can be packaged together with code, runtime, system tools, libraries, and OS into an immutable, stand-alone, custom-built, persistable Docker image (container), which can be run anytime, anywhere, on any platform that supports Docker. In late 2014, Travis-CI began running builds and tests inside Docker containers, each customized for a specific run, as specified in the Travis-CI .travis.yml files. Travis-CI maintains some of its base images containing a minimal build environment. BugSwarm harvests these containers to create the dataset.

II-B Leveraging Travis-CI to Mine and Reproduce Bugs

We exploit Travis-CI to create BugSwarm. Figure 1 depicts the lifecycle of a Travis-CI-built and tested PR. A contributor forks the repository and adds three commits, up to prV1; she then opens a PR, asking that her changes be merged into the original repository. The creation of the PR triggers Travis-CI, which checks whether there are merge conflicts between the PR branch and master when the PR was opened (prV1 and baseV1). If not, Travis-CI creates a temporary branch from the base branch, into which the PR branch is merged to yield temp1. This merge is also referred to as a “phantom” merge because it disappears from the Git history after some time.333“Phantom” merges present special challenges, which are discussed later. Travis-CI then generates build scripts from the .travis.yml file and initiates a build, i.e., runs the scripts to compile, build, and test the project.

In our example, test failures cause the first build to fail; Travis-CI notifies the contributor and project maintainers, as represented by the dashed arrows in Figure 1. The contributor does her fix and updates the PR with a new commit, which triggers a new build. Again, Travis-CI creates the merge between the PR branch (now at prV2) and the base branch (still at baseV1) to yield temp2. The build fails again; apparently the fix was no good. Consequently, the contributor updates the PR by adding a new commit, prV3. A Travis-CI build is triggered in which the merge (temp3) between the PR branch (at prV3) and the base branch (now at baseV2) is tested.444Travis-CI creates each phantom merge on a separate temporary branch, but Figure 1 shows the phantom merges on a single branch for simplicity. This time, the build passes, and the PR is accepted and merged into the base branch.

Each commit is recorded in version control, archiving source code at build-time plus the full build configuration (.travis.yml file). Travis-CI records how each build fared (pass or fail) and archives a build log containing output of the build and test process, including the names of any failing tests.

Our core idea is that Travis-CI-built and tested pull requests (and regular commits) from GitHub, available in large volumes for a variety of languages and platforms, can be used to construct fail-pass pairs. In our example, the version of the code represented by the merge temp2 is “defective,” as documented by test failures in the corresponding Travis-CI build log. The subsequently “fixed” version (no test failures in the build log) is represented by temp3. Therefore, we can extract (1) a failing program version; (2) a subsequent, fixed program version; (3) the fix, i.e., the difference between the two versions; (4) the names of failing tests from the failed build log; (5) a full description of the build configuration.

Since each Travis-CI job occurs within a Docker container, we can re-capture that specific container image, thus rendering the event durably reproducible. Furthermore, if one could build an automated harvesting system that could continually mine Travis-CI builds and create Docker images that could persist these failures and fixes, this promises a way to create a dataset to provide all of our desired data: GitHub-level scale; GitHub-level diversity; realism of popular OSS projects; currency via the ability to automatically and periodically augment our dataset with recent events, and finally durable reproducibility via Docker images.

III BugSwarm Infrastructure

III-A Some Terminology

A project’s build history refers to all Travis-CI builds previously triggered. A build may include many jobs; for example, a build for a Python project might include separate jobs to test with Python versions 2.6, 2.7, 3.0, etc.

A commit pair is a 2-tuple of Git commit SHAs that each triggered a Travis-CI build in the same build history. The canonical commit pair consists of a commit whose build fails the tests followed by a fix commit whose build passes the tests. The terms build pair and job pair refer to a 2-tuple of Travis-CI builds or jobs, respectively, from a project’s build history. For a given build, the trigger commit is the commit that, when pushed to the remote repository, caused Travis-CI to start a build.

BugSwarm has four components: PairMiner, PairFilter, Reproducer, and Analyzer. These components form the pipeline that curates BugSwarm artifacts and are designed to be relatively independent and general. This section describes the responsibilities and implementation of each component, and a set of supporting tools that facilitate usage of the dataset.

III-B Design Challenges

The tooling infrastructure is designed to handle certain specific challenges, listed below, that arise when one seeks to continuously and automatically mine Travis-CI. In each case, we list the tools that actually address the challenges.

Pair coherence. Consecutive commits in a Git history may not correspond to consecutive Travis-CI builds. A build history, which Travis-CI retains as a linear series of builds, must be traversed and transformed into a directed graph so that pairs of consecutive builds map to pairs of consecutive commits. Git’s non-linear nature makes this non-trivial. (PairMiner)

Commit recovery. To reproduce a build, one needs to find the trigger commit. There are several (sub-)challenges here. First, temporary merge commits like temp1,2,3 in Figure 1 are the ones we need to extract, but these are not retained by Travis-CI. Second, Git’s powerful history-rewriting capabilities allow commits to be erased from history; developers can and do collapse commits like prV1,2,3 into a single commit, thus frustrating the ability to recover the consequent phantom merge commits. (PairMiner, PairFilter)

Image recovery. In principle, Travis-CI creates and retains Docker images that allow re-creation of build and test events. In practice, these images are not always archived as expected and so must be reconstructed. (PairFilter, Reproducer)

Runtime recovery. Building a specific project version often requires satisfying a large number of software dependencies on tools, libraries, and frameworks; all or some of these may have to be “time-traveled” to an earlier version. (Reproducer)

Test flakiness. Even though Travis-CI test behavior is theoretically recoverable via Docker images, tests may behave non-deterministically because of concurrency or environmental (e.g., external web service) changes. Such flaky tests lead to flaky builds, which both must be identified for appropriate use in experiments. (Reproducer)

Log analysis. Once a pair is recoverable, BugSwarm tries to determine the exact nature of the failure from the logs, which are not well structured and have different formats for each language, build system, and test toolset combination. Thus the logs must be carefully analyzed to recover the nature of the failure and related metadata (e.g., raised exceptions, failed test names, etc.), so that the pair can be documented. (Analyzer)

III-C Mining Fail-Pass Pairs

PairMiner extracts from a project’s Git and build histories a set of fail-pass job pairs (Algorithm 1). PairMiner takes as input a GitHub slug and produces a set of fail-pass job pairs annotated with trigger commit information for each job’s parent build. The PairMiner algorithm involves (1) delinearizing the project’s build history, (2) extracting fail-pass build pairs, (3) assigning commits to each pair, and (4) extracting fail-pass job pairs from each fail-pass build pair.

Analyzing build history. PairMiner first downloads the project’s entire build history with the Travis-CI API. For each build therein, PairMiner notes the branch and (if applicable) the pull request containing the trigger commit, PairMiner first resolves the build history into lists of builds that were triggered by commits on the same branch or pull request. PairMiner recovers the results of the build and its jobs (passed, failed, errored, or canceled), the .travis.yml configuration of each job, and the unique identifiers of the build and its jobs using the Travis-CI API.

Identifying fail-pass build pairs. Using the build and job identifiers, PairMiner finds consecutive pairs where the first build failed and the second passed. Builds are considered from all branches, including the main line and any perennials, and both merged and unmerged pull requests. Next, the triggering commits are found, and recovered from Git history.

Finding trigger commits. If the trigger commit was a push event, then PairMiner can find its SHA via the Travis-CI API. For pull request triggers, we need to get the pull request and base branch head SHAs, and re-create the phantom merge. Unfortunately, neither the trigger commit nor the base commit are stored by Travis-CI; recreating them is quite a challenge. Fortunately, the commit message of the phantom commit, which is stored by Travis-CI, contains this information; we follow Beller et al. [17] to extract this information. This approach is incomplete but is the best available.

Travis-CI creates temporary merges for pull-request builds. While temporary merges may no longer be directly accessible, the information for such builds (the head SHAs and base SHAs of the merges) are accessible through the GitHub API. We resort to GitHub archives to retrieve the code for the commits that are no longer in Git history.

Even if the trigger commit is recovered from the phantom merge commit, one problem remains: developers might squash together all commits in a pull request, thus erasing the constituent commits of the phantom merge right out of the Git history. In addition, trigger commits for push event builds can sometimes also be removed from the Git history by the project personnel. As a result, recreating this merge is not always possible; we later show the proportion for which we were able to reset the repository to the commits in the fail-pass pairs. The two steps of phantom recovery—first finding the trigger commits and then ensuring that the versions are available in GitHub —are described in Algorithm 2.

Extracting fail-pass job pairs. PairMiner now has a list of fail-pass build pairs for the project. As described in Section II, each build can have many jobs, one for each supported environment. A build fails if any one of its jobs fails and passes if all of its jobs pass. Given a failing build, PairMiner finds pairs of jobs, executed in the same environment, where the first failed and the second passed. Such a pair only occurs when a defective version was fixed via source code patches and not by changes in the execution environment (see Algorithm 1).

III-D Finding Essential Information for Reproduction

Pairs identified by PairMiner must be assembled into reproducible containers. To stand a chance of reproducing a job, one must have access to, at a minimum, these essentials: (1) the state of the project at the time the job was executed and (2) the environment in which the job was executed. For each job in the pipeline, PairFilter checks that these essentials can be obtained. If the project state was deemed recoverable by PairMiner, PairFilter retrieves the original Travis-CI log of the job and extracts information about the execution environment. Using timestamps and instance names in the log, PairFilter determines if the job executed in a Docker container and, if so, whether the corresponding image is still accessible. If the log is unavailable, the job was run before Travis-CI started using Docker, or the particular image is no longer publicly accessible, then the job is removed from the pipeline.

III-E Reproducing Fail-Pass Pairs

Reproducer checks if each job is durably reproducible. This takes several steps, described below.

Generating the job script. travis-build,555https://github.com/travis-ci/travis-build a component of Travis-CI, produces a shell script from a .travis.yml file for running a Travis-CI job. Reproducer then alters the script to reference a specific past version of the project, rather than the latest.

Matching the environment.

To match the original job’s runtime environment, Reproducer chooses from the set of Travis-CI’s publicly available Docker images, from Quay and DockerHub, based on (1) the language of the project, as indicated by its .travis.yml configuration, and (2) a timestamp and instance name in the original job log that indicate when that image was built with Docker’s tools.

Reverting the project. For project history reversion, Reproducer clones the project and resets its state using the trigger commit mined by PairMiner. If the trigger was on a pull request, Reproducer re-creates the phantom merge commit using the trigger and base commits mined by PairMiner. If any necessary commits were not found during the mining process, Reproducer downloads the desired state of the project directly from a zip archive maintained by GitHub.666 GitHub allows one to download a zip archive of the entire project’s file structure at a specific commit. Since this approach produces a stand-alone checkout of a project’s history (without any of the Git data stores), Reproducer uses this archive only if a proper clone and reset is not possible. Finally, Reproducer plants the state of the project inside the execution environment to reproduce the job.

Reproducing the job. Reproducer creates a new Docker image, as described in Section III-E, runs the generated job script, and saves the resulting output stream in a log file. Reproducer can run multiple jobs in parallel. Reproducer collects the output logs from all the jobs it attempts to reproduce and sends them to Analyzer for parsing.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1ant [Accessed 2019] Apache Ant. http://ant.apache.org , Accessed 2019.
2azu [Accessed 2019] Microsoft Azure. http://azure.microsoft.com , Accessed 2019.
3cob [Accessed 2019] Cobertura. https://github.com/cobertura/cobertura/wiki , Accessed 2019.
4def [Accessed 2019] Defects 4J. https://github.com/rjust/defects 4j , Accessed 2019.
5err [Accessed 2019] Error Prone. https://github.com/google/error-prone , Accessed 2019.
6gra [Accessed 2019] Gradle Build Tool. https://gradle.org , Accessed 2019.
7jun [Accessed 2019] J Unit Test Framework. https://junit.org/junit 5 , Accessed 2019.
8mav [Accessed 2019] Apache Maven Project. https://maven.apache.org , Accessed 2019.