SZZ Unleashed: An Open Implementation of the SZZ Algorithm -- Featuring   Example Usage in a Study of Just-in-Time Bug Prediction for the Jenkins   Project

Markus Borg; Oscar Svensson; Kristian Berg; Daniel Hansson

arXiv:1903.01742·cs.SE·August 20, 2019

SZZ Unleashed: An Open Implementation of the SZZ Algorithm -- Featuring Example Usage in a Study of Just-in-Time Bug Prediction for the Jenkins Project

Markus Borg, Oscar Svensson, Kristian Berg, Daniel Hansson

PDF

2 Repos

TL;DR

This paper introduces SZZ Unleashed, an open-source implementation of the SZZ algorithm for identifying bug-introducing changes in git repositories, demonstrated through a case study on the Jenkins project and its application in bug prediction.

Contribution

It provides the first publicly available, tested implementation of the SZZ algorithm, facilitating reproducible research and community collaboration in bug analysis.

Findings

01

Open SZZ implementation available on GitHub

02

Applied to Jenkins project for bug prediction study

03

Encourages community contributions and further development

Abstract

Numerous empirical software engineering studies rely on detailed information about bugs. While issue trackers often contain information about when bugs were fixed, details about when they were introduced to the system are often absent. As a remedy, researchers often rely on the SZZ algorithm as a heuristic approach to identify bug-introducing software changes. Unfortunately, as reported in a recent systematic literature review, few researchers have made their SZZ implementations publicly available. Consequently, there is a risk that research effort is wasted as new projects based on SZZ output need to initially reimplement the approach. Furthermore, there is a risk that newly developed (closed source) SZZ implementations have not been properly tested, thus conducting research based on their output might introduce threats to validity. We present SZZ Unleashed, an open implementation of…

Tables3

Table 1. Table 1. Descriptive statistics of the extracted Jenkins dataset and five analogous datasets from previous work (Kamei et al . , 2013 ) .

Dataset	#Bugs	#Fixes	#(Fixes $\cap$ Bugs)	#Commits
Jenkins	954 (3.6%)	2,979 (11.3%)	808 (3.1%)	26,378
Bugzilla	1,696 (36.1%)	3,973 (86.0%)	1,586 (34.3%)	4,620
Columba	1,361 (30.5%)	1,463 (32.8%)	439 (9.6%)	4,455
JDT	5,089 (14.4%)	10,799 (30.5%)	2,218 (6.3%)	35,386
Mozilla	5,149 (5.2%)	62,888 (64.0%)	3,943 (4.0%)	98,275
Postgres	5,119 (25.1%)	8,933 (43.7%)	2,043 (10.0%)	20,431

Table 2. Table 2. Features used to represent commits.

ID	Feature	Rel. Sign.
Ft1	Lines of code added / Total lines of code	0.17
Ft2	Lines of code deleted / Total lines of code	0.04
Ft3	Files churned / Number of files	0.08
Ft4	Lines of code in previous version	0.07
Ft5	Number of modified subsystems	0.11
Ft6	Number of modified sub-directories	0.09
Ft7	Entropy (spreading of changes)	0.16
Ft8	Purpose of a change (e.g., bug fix)	0.03
Ft9	Number of previous committers	0.08
Ft10	Time between committer’s contributions	0.04
Ft11	Number of unique changes	0.04
Ft12	Overall experience of committer	0.04
Ft13	Recent experience of committer	0.03
Ft14	Number of highly coupled files	0.00
Ft15	Number of coupled files for all degrees	0.01
Ft16	Number of non-modified coupled files	0.01

Table 3. Table 3. Classification accuracy for JIT bug prediction.

Stratified 10-fold Cross-Validation
Sampling technique	Precision	Recall	F1 score
Baseline	0.156 $\pm$ 0.246	0.026 $\pm$ 0.042	0.029 $\pm$ 0.034
SMOTE	0.123 $\pm$ 0.076	0.212 $\pm$ 0.136	0.154 $\pm$ 0.096
SMOTE+Tomek	0.117 $\pm$ 0.071	0.206 $\pm$ 0.130	0.148 $\pm$ 0.091
Cluster Centroids	0.037 $\pm$ 0.001	0.945 $\pm$ 0.037	0.072 $\pm$ 0.002
Online Change Classification
Sampling technique	Precision	Recall	F1 score
Baseline	0.210 $\pm$ 0.177	0.017 $\pm$ 0.014	0.031 $\pm$ 0.026
SMOTE	0.147 $\pm$ 0.041	0.104 $\pm$ 0.034	0.116 $\pm$ 0.031
SMOTE+Tomek	0.163 $\pm$ 0.018	0.126 $\pm$ 0.043	0.137 $\pm$ 0.030
Cluster Centroids	0.028 $\pm$ 0.004	0.917 $\pm$ 0.037	0.054 $\pm$ 0.008

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

SZZ Unleashed: An Open Implementation of the SZZ Algorithm

– Featuring Example Usage in a Study of Just-in-Time Bug Prediction for the Jenkins Project

Markus Borg

ICT SICS

RISE Research Institutes of Sweden ABLundSweden

[email protected]

,

Oscar Svensson

Dept. of Computer Science

Lund UniversityLundSweden

[email protected]

,

Kristian Berg

Dept. of Computer Science

Lund UniversityLundSweden

[email protected]

and

Daniel Hansson

Verifyter ABLundSweden

[email protected]

Abstract.

Numerous empirical software engineering studies rely on detailed information about bugs. While issue trackers often contain information about when bugs were fixed, details about when they were introduced to the system are often absent. As a remedy, researchers often rely on the SZZ algorithm as a heuristic approach to identify bug-introducing software changes. Unfortunately, as reported in a recent systematic literature review, few researchers have made their SZZ implementations publicly available. Consequently, there is a risk that research effort is wasted as new projects based on SZZ output need to initially reimplement the approach. Furthermore, there is a risk that newly developed (closed source) SZZ implementations have not been properly tested, thus conducting research based on their output might introduce threats to validity. We present SZZ Unleashed, an open implementation of the SZZ algorithm for git repositories. This paper describes our implementation along with a usage example for the Jenkins project, and conclude with an illustrative study on just-in-time bug prediction. We hope to continue evolving SZZ Unleashed on GitHub, and warmly invite the community to contribute.

SZZ, defect prediction, mining software repositories, issue tracking

††copyright: rightsretained††doi: 10.475/123_4††isbn: 123-4567-24-567/08/06††ccs: Software and its engineering Software configuration management and version control systems††ccs: Software and its engineering Software maintenance tools††ccs: Software and its engineering Maintaining software

1. Introduction

Empirical software engineering research often rely on detailed bug information. Bug information is often maintained in issue trackers such as Jira or BugZilla, which has enabled numerous publications related to mining software repositories (Cavalcanti et al., 2014; de Freitas Farias et al., 2016). However, while issue trackers often contain details about both bugs (e.g., version information, references to failed test case executions) and the subsequent bug fixes (e.g., who developed the fix and a reference to a specific commit with the resolution), information about the root cause of a bug and when it was introduced are often missing.

In many software engineering research studies, knowing which individual commit that introduced a bug is essential – examples include work on fault prediction (Hall et al., 2012), test case selection (Engström et al., 2010), and static code analysis (Rahman et al., 2014). One approach to address missing bug information is to heuristically deduce it. Successful approaches to extend the information stored in issue trackers can be of great value to empirical software engineering. However, for such an approach to be useful, it has to deliver reliable output that both industry and academia trust (Fenton and Neil, 1999; Czerwonka et al., 2011).

A popular approach to extend bug information is to propose “bug-introducing changes” for the existing Bug Reports (BR). The dominant algorithm to do this is called SZZ, after the three authors of the seminal paper (Sliwerski et al., 2005): Śliwerski, Zimmermann, and Zeller. In a recent study on reproducability and credibility of software engineering research, Rodriguez-Perez et al. presented a systematic literature review on research that used SZZ (Rodriguez-Perez et al., 2018). They identified 187 studies, and found that researchers typically implement their own versions of SZZ rather than building on what others have previously done. Rodriguez-Perez et al. suggest that one reason is that researchers rarely make the SZZ implementations publicly available, thus any researcher relying on SZZ must first implement it from scratch. While there are some partial SZZ implementations available (Rosen et al., 2015; Correia, 2017), Rodriguez-Perez et al. call for researchers to publish source code to allow others to fork the project.

In this paper, we respond to Rodriguez-Perez et al.’s call for improved reproducability through an open source implementation of SZZ. We introduce SZZ Unleashed – available on GitHub under an MIT license since June 2018 (Svensson and Berg, 2018). The source code was developed as part of a MSc. thesis project at Axis Communications AB in Lund, Sweden. We have tested SZZ Unleashed on the repository of the Jenkins automation server111https://github.com/jenkinsci/jenkins and used the results to train a random forest classifier for Just-In-Time (JIT) bug prediction (Kamei et al., 2013), i.e., to identify high-risk changes at commit-time. At the time of this writing, SZZ Unleashed has been forked four times – at least twice by senior researchers from academia – and we have approved the first external pull request. Since members of the research community have already found and forked SZZ Unleashed, we conclude that there is a demand for our implementation.

The rest of the paper is structured as follows. In Section 2 we introduce the SZZ algorithm and some later improvements. Section 3 presents the implementation of SZZ Unleashed along with an example for the Jenkins project. In Section 4, we illustrate how the SZZ Unleashed output for the Jenkins project can be used, by training a random forest classifier for JIT bug prediction. Finally, Section 5 concludes the paper and presents how we would like SZZ Unleashed to evolve.

2. Background – The SZZ algorithm

The SZZ algorithm was developed as an approach to identify bug-introducing commits in a software repository. It was introduced by Śliwerski et al. (Sliwerski et al., 2005), and was later given its name after the initials of the three authors. While the SZZ algorithm was developed for the CVS version control system and its corresponding commit practices, SZZ has evolved also for software repositories that use git. The SZZ algorithm is organized in two subsequent phases.

In the first phase, BRs in the issue tracker are linked to bug-fixing commits. This is done by using regular expressions to find explicit references to BRs in commit messages. If the content of the issue tracker is less structured, then commit messages that contain the word “fix” – or whatever convention is used in the project under study – are assumed to be bug fixes. For each of the bug-fixing commits that were identified, the modified lines in the source code are extracted.

Figure 1 shows the steps in the second phase. For each bug-fixing commit from the first phase (A), SZZ uses the git blame command (B) to identify all commits that previously made changes to the same lines of code. Git blame shows what revision and author last modified each line of a file, i.e., executing git blame on a bug-fixing commit results in a set of commits that might have introduced the bug. We refer to these a bug-introducing commit candidates (C).

For each candidate, SZZ determines whether it can be ruled out as bug-introducing or not (D). First, the commit time of a candidate is compared to the time when the corresponding BR was submitted. If the commit time is later than the report submission time, the candidate can be bug-introducing only if it is 1) a partial fix, i.e., a fix that did not completely resolve the bug it intended to resolve – as made evident by a later bug-fixing commit for the same issue, or if it is 2) responsible for another bug, i.e., the candidate is responsible for a bug different from the one resolved by the bug-fixing commit that blamed the candidate. This means that another bug-fixing commit can have its bug origin in this commit because they have both made changes to the same file. We present more details in Section 3, where we describe our implementation of SZZ.

Kim et al. presented improvements to the SZZ algorithm (Kim et al., 2006), including annotation graphs created by origin analysis (Godfrey and Zou, 2005) and an approach to filter out cosmetic changes to source code. Figure 2 presents an annotation graph for a source code file, created by mapping different revision of the source code by using the git annotate command (exists also in CVS and SVN, now replaced by git blame). Each node shows a version of a single line of code and edges illustrate relations between revisions. The first four lines (A) are unmodified between the three revisions. Nodes 4–6 in revision 2 did not exist in the revision 1, i.e., these lines of code were inserted. Consequently, the fifth line in revision 1 (node 4) is instead mapped to node 7 in revision 2. Nodes 16 and 17 in Revision 2 are not mapped to any lines in Revision 3, i.e., they were deleted.

If multiple adjacent lines are changed, then the annotation graph technique will fail to map those lines between revisions. Adjacent lines of code that are modified map will simply be mapped to the same set of lines in the next revision. This can be seen for nodes 7–14 in Revision 1, which are mapped to nodes 10–17 in revision 2. Consequently, SZZ relying on annotation graphs is not very precise in tracing changes across revisions.

Williams and Spacco addressed the lack of SZZ precision by replacing the annotation graph with an distance-based approach they call line number mapping (Williams and Spacco, 2008). The core concept, based on work by Canfora et al. (Canfora et al., 2007), is that individual lines of code are mapped between revisions by calculating normalized Levenshtein edit distances between all candidate mappings, and considering the pair with the lowest distance to be a valid mapping. Although the mapping is not always correct, the authors claim that the added precision is useful for SZZ. The current implementation of SZZ Unleashed is based on Jaccard distances, but could easily be extended to other measures.

3. SZZ Unleashed – Implementation

SZZ Unleashed is a Java implementation, with some supporting Python scripts, of the SZZ algorithm for git repositories. The implementation is based on the seminal paper by Śliwerski et al. (Sliwerski et al., 2005) and later enhancements by Williams and Spacco (Williams and Spacco, 2008). To facilitate interaction with git repositories, SZZ Unleashed uses the JGit library (Sohn et al., 2017) maintained by the Eclipse Foundation. Using this library, we reduced the use of text parsing and could work directly on the git revision structure. Working with SZZ Unleashed means following the general SZZ workflow:

(1)

Extract closed BRs from an issue tracker (prerequisite step) 2. (2)

Link individual BRs to bug-fixing commits (SZZ Phase 1) 3. (3)

Identify bug-introducing commits for the bug-fixing commits (SZZ Phase 2)

We explain the implementation of SZZ Unleashed through a running example on the core repository of the Jenkins project. Jenkins constitutes an appropriate example with numerous contributors, including developers in proprietary organizations, consisting of roughly 1 MLoC (predominantly Java) with a well-managed Jira server for issue tracking. The Jenkins project uses a convention to explicitly state unique identifiers (ID) of BRs in bug-fixing commit messages, i.e., SZZ Phase 1 is straight-forward. Furthermore, the Jira REST API and associated Jira Query Language (JQL) greatly simplifies extraction of BRs.

Detailed instructions to get started with SZZ Unleashed is available in the README on GitHub (Svensson and Berg, 2018). Prerequisites to build SZZ Unleashed include Gradle and Java 8. Furthermore, to replicate the running example, Python is needed to run the scripts for extracting defect reports from Jenkin’s issue tracker and to process them into the input format used by SZZ Unleashed. Alternatively, users can download and run the Docker image available on GitHub.

3.1. Phase 1 – Bug-fixing commits

As shown in Figure 1, Phase 1 results in a set of bug-fixing commits. First, we need to extract BRs from an issue tracker. For the Jira server used by the Jenkins community, we execute the following JQL query:

project = JENKINS AND issuetype = Bug AND status in (Resolved, Closed) AND resolution = Fixed AND component = core AND created <= "2018-02-20 10:34" ORDER BY created DESC

where $issuetype$ eliminates other types of issues such as feature requests, $status$ eliminates issues that are still open, $resolution$ eliminates duplicated BRs, $component$ excludes issues concerning other repositories, $created$ can is used to set a time interval, and ORDER BY sorts the BRs in reverse chronological order.

The unique IDs of the BRs are used to find bug-fixing commits by executing regular expression (regex) patterns on the git log of the software repository under study. For the Jenkins repository, three different formats for referencing BRs exist, namely JENKINS-XXX, HUDSON-XXX and #XXX, where XXX is the BR ID. The Python script below specifies the regex pattern that could be used, where $key$ is the ID of a BR formatted as JENKINS-XXX and $nbr$ is the associated number XXX. If there is a match for the #XXX pattern, we perform an extra regex search to verify that the commit message also contains the word ‘fix’, otherwise the corresponding commit is not considered as bug-fixing.

String pattern = key + ’\D|’ + ’#’ + nbr +
’\D|HUDSON-’ + nbr + ’\D’

The above regex pattern can match multiple commit messages for each BR. SZZ Unleashed uses another regex to identify the true bug-fixing commit among these matches, i.e., we exclude ‘merge’, ‘cherry pick’, and ‘nothing’ commits. Among the remaining commits, SZZ Unleashed considers the most recent commit as bug-fixing. The following Python code shows the implementation:

def commit_selector_heuristic(commits): for commit in commits: if(re.search(’[Mm]erge|[Cc]herry|[Nn]oting’, commit)): continue return commit return commits[0]

3.2. Phase 2 – Bug-Introducing Commits

SZZ Unleashed largely implements the approach Williams and Spacco (Williams and Spacco, 2008), i.e., line number mappings are used to backtrack through the change history. However, our implementation is not language-specific, thus we do not filter out cosmetic changes. Figure 3 shows an example of SZZ Unleashed Phase 2. Note that the graph does not show all lines of code, but rather the lines that were altered by the commits. Each changed line of source code is tracked to its creation or a more recent version. How deeply SZZ Unleashed should traverse the graph is configurable, but the default value is 3. Running git blame on Commit 6 in Figure 3 with $depth=1$ would find all but one commit (Commit 2). Using a higher depth setting, however, we can trace to the original commit of any line of code. Thus, with $depth>=3$ , node 0 in Commit 6 would be traced to node 0 in Commit 2.

Suppose that Phase 1 of SZZ Unleashed identified Commit 3 as a bug-fixing commit for BR A. In phase two, git blame is used to identify Commit 2 and Commit 1 – these would be bug-introducing commit candidates. Next, suppose that Commit 2 was made after BR A was submitted, i.e., it is ‘Newer’ as shown in (D) in Figure 1, and its commit message does not identify it as a partial bug-fix, i.e., there is no explicit reference to a BR. However, Commit 2 can still be bug-introducing for another BR if any of Commit 4, Commit 5 or Commit 6 are bug-fixing commits. Suppose that Commit 6 is a bug-fixing commit for BR B, then Commit 2 will be categorized as a bug-introducing commit – since it made changes to the same lines of code.

4. Illustrative Study – JIT Bug Prediction

This section presents an example of how the output from SZZ Unleashed can be used. We use the output to train a classifier to identify bug-introducing commits, i.e., JIT bug prediction (referred to as JIT quality assurance by Kemal et al. (Kamei et al., 2013)). The overall idea is to indicate commits that might require particularly careful code reviews, i.e., providing risk profiles for individual commits.

4.1. Research goal and method

How to sample training and test data when evaluating classifiers intended for deployment in issue trackers is critical. First, previous work on supervised learning for issue trackers revealed that disregarding the time dimension, as is done in traditional cross-validation, might lead to overly positive results (Tan et al., 2015; Jonsson et al., 2016) – training a classifier on data “from the future” is apparently questionable. Second, class imbalance problems might require techniques for oversampling and undersampling, e.g., only 3.6% of the commits are bug-introducing in our Jenkins dataset (cf. Table 1).

We train random forest classifiers to predict bug-introducing commits, i.e, JIT bug prediction. We choose random forest for two reasons: 1) the trained models are reasonably interpretable and 2) Yang et al. previously obtained good results in a similar context (Yang et al., 2017). Based on intial trial runs, we set the number of trees to 200. Based on the Jenkins datatset created using SZZ Unleashed, we investigate two research questions:

RQ1

How does oversampling and undersampling affect the JIT bug prediction for highly imbalanced classes?

RQ2

Does cross-validation generate better results than a time-sensitive evaluation setup?

We investigate RQ1 by comparing a baseline without particular sampling techniques to three approaches that result in an equal proportion of positive and negative training examples – all available in the sci-kit learn library imbalanced-learn (Lemaitre et al., 2017). SMOTE oversamples, Cluster Centroids undersamples, and SMOTE+Tomek combines oversampling and undersampling. As shown in Table 1, our Jenkins dataset contains only 3.6% positive examples, i.e., the class imbalance problem is more evident than in previous work.

We study RQ2 by comparing stratified 10-fold cross-validation to “Online Change Classification” as described by Tan et al. (Tan et al., 2015), i.e., an approach to respect the time dimension when defining training and test data. Using the terminology introduced by the authors, we used the following configuration of time ‘gaps’: SGAP=331, GAP=73, EGAP=781, Update=200, Training duration=1,700, and Test duration=400 (all units in days).

4.2. Data collection and feature selection

In line with the description in Section 3, we use SZZ Unleashed to extract bug-introducing commits from 12 years of development history in the Jenkins core repository (from Nov 5, 2006 until Feb 20, 2018). Table 1 shows descriptive statistics of the resulting dataset and analogous statistics from five datasets collected by Kamei et al. (Kamei et al., 2013). Bug-introducing commits and bug-fixing commits are listed as ‘Bugs’ and ‘Fixes’, respectively – along with commits that are categorized as both. Percentages show the fraction of ‘Bugs’ and ‘Fixes’ among the total number of commits.

Based on previous work on bug prediction, we represent commits by 16 features as presented in Table 2. Ft1–Ft3 are related to code churn as defined by Nagappan and Ball et al. (Nagappan and Ball, 2005), Ft4–Ft13 were all used by Kamei et al. (Kamei et al., 2013), and Ft14–Ft16 consider coupling as proposed by D’Ambros et al. (D’Ambros et al., 2009). We used code-maat version 1.1 to calculate the values for the coupling features (Tornhill, 2017). Table 2 also shows the relative significance of the 16 features in the random forest classifiers (described next). We observe that the ranking of features is similar to findings from previous work (Moser et al., 2008; Hall et al., 2012).

4.3. Results and discussion

Table 3 shows the classification accuracy of the random forest classifiers for eight different experimental runs. The table shows precision, recall, and F1 score for two evaluation setups: 1) stratified 10-fold cross-validation and 2) online change classification. For both setups, we report results from applying four different sampling techniques. All values are reported with standard deviations.

We find that oversampling is essential for JIT bug prediction subject to the class imbalance problem (RQ1). Both for cross-validation and online change classification, using SMOTE or SMOTE+Tomek obtains considerably higher F1 scores compared to the baseline. Oversampling results in decreased precision, but also substantial improvements in recall. It is clear that using the baseline sampling leads to a too conservative classifier (recall $<3\%$ ) for the highly imbalanced Jenkins dataset. Note that also undersampling using Cluster Centroids improves recall and F1 score, but the resulting precision ( $<4\%$ ) would never be useful in practice – probably the resulting training set contains too few examples for the classifier to learn from.

Our investigation of time-sensitivity largely confirms findings from previous work, i.e., disregarding the time dimension in issue trackers might lead to overly positive classifier evaluations (RQ2). Table 3 shows that F1 scores for cross-validation are higher than for online change classification. Jonsson et al. reported that “cross-validation consistently yielded higher prediction accuracy than conducting more realistic evaluations on bug reports sorted by the submission date” (Jonsson et al., 2016) in a study on multi-class classification for bug assignment. Analogous to our work, Tan et al. performed binary classification for JIT bug prediction. They concluded that “cross-validation presents a false impression of higher precisions” (Tan et al., 2015).

On the other hand, our findings partly contrast previous conclusions. Tan et al. specifically points out that cross-validations results in falsely high precision results. In our study, however, we instead observe this phenomenon for recall. For sampling using SMOTE and SMOTE+Tomek, cross-validation obtains roughly twice as high recall as the time-sensitive setup. Thus, we conclude that using cross-validation can lead to overly positive results both for precision and recall – both should be carefully investigated in empirical studies.

On a final note, we do not think JIT bug prediction corresponding to an F1 score of 0.10–0.15 is sufficiently accurate to be of practical value for developers. The false positives would be too many for developers to trust the predictions, and at the same time the classifier would miss too many truly bug-introducing commits. Moreover, the SZZ algorithm has limitations (Rodriguez-Perez et al., 2018) – possibly the classification accuracy could reach the utility break-point if instead a manually annotated training set of commits was used. Nonetheless, such investigations, and a deeper analysis of threats to validity of our small empirical inquiry, is beyond the scope of the illustrative example presented in this section.

5. Conclusion

Numerous software engineering studies rely on the SZZ algorithm. Unfortunately, due to the lack of publicly available tool solutions, most researchers must implement their own versions. While the learning process for the individual researcher might be valuable, we argue that the lack of a public SZZ tool might lead to 1) the community reinventing the wheel, 2) hampered reproducability, and 3) research results based on non-disclosed SZZ implementations that might contain bugs.

We respond to the call by Rodriguez et al. (Rodriguez-Perez et al., 2018) and present SZZ Unleashed, an implementation of the SZZ algorithm publicly available on GitHub under an MIT license. SZZ Unleashed is implemented in Java, with some supporting Python scripts, and includes line number mappings – an improvement proposed by Williams and Spacco (Williams and Spacco, 2008). We have already approved the first external pull request and we warmly welcome further contributions from the community.

To illustrate how SZZ Unleashed can be used, both this paper and the GitHub repository are accompanied by an example study of JIT bug prediction using a random forest classifier for the Jenkins project. We report modest classification accuracy (F1 score of roughly 15%), but corroborate two findings from previous work. First, oversampling is essential in JIT bug prediction for highly imbalanced classes. Second, solely presenting results from cross-validation is not appropriate when evaluating classifiers for software engineering data with timestamps – there is a high risk of obtaining an excessively positive classification accuracy.

Acknowledgements.

Our thanks go to Sven Selberg and Axis Communication AB for hosting the MSc thesis project resulting in this paper. This work has been financially supported by the ITEA3 initiative TESTOMAT Project through Vinnova – Sweden’s innovation agency.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Canfora et al . (2007) G. Canfora, L. Cerulo, and M. Di Penta. 2007. Identifying Changed Source Code Lines from Version Repositories. In Proc. of the 4th International Workshop on Mining Software Repositories . https://doi.org/10.1109/MSR.2007.14 · doi ↗
3Cavalcanti et al . (2014) Y. Cavalcanti, P. Silveira Neto, I. Machado, T. Vale, E. Almeida, and S. Meira. 2014. Challenges and Opportunities for Software Change Request Repositories: A Systematic Mapping Study. Journal of Software: Evolution and Process 26, 7 (2014), 620–653. https://doi.org/10.1002/smr.1639 · doi ↗
4Correia (2017) J. Correia. 2017. old-szz . https://github.com/intelligentagents/old-szz
5Czerwonka et al . (2011) J. Czerwonka, R. Das, N. Nagappan, A. Tarvo, and A. Teterev. 2011. CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice – Experiences from Windows. In Proc. of the 4th Conference on Software Testing, Verification and Validation . 357–366. https://doi.org/10.1109/ICST.2011.24 · doi ↗
6D’Ambros et al . (2009) M. D’Ambros, M. Lanza, and R. Robbes. 2009. On the Relationship Between Change Coupling and Software Defects. In 2009 16th Working Conference on Reverse Engineering . 135–144. https://doi.org/10.1109/WCRE.2009.19 · doi ↗
7de Freitas Farias et al . (2016) M. de Freitas Farias, R. Novais, M. Junior, L. da Silva Carvalho, M. Mendonca, and R. Spinola. 2016. A Systematic Mapping Study on Mining Software Repositories. In Proc. of the 31st Annual ACM Symposium on Applied Computing . 1472–1479. https://doi.org/10.1145/2851613.2851786 · doi ↗
8Engström et al . (2010) E. Engström, P. Runeson, and M. Skoglund. 2010. A Systematic Review on Regression Test Selection Techniques. Information and Software Technology 52, 1 (2010), 14–30. https://doi.org/10.1016/j.infsof.2009.07.001 · doi ↗