# Revisiting Unsupervised Learning for Defect Prediction

**Authors:** Wei Fu, Tim Menzies

arXiv: 1703.00132 · 2017-06-27

## TL;DR

This study reevaluates the effectiveness of unsupervised versus supervised defect prediction methods in software engineering, finding that supervised models generally perform better when analyzed on a per-project basis, but both approaches have potential.

## Contribution

The paper critically examines prior claims that unsupervised defect predictors outperform supervised ones, providing a nuanced analysis that highlights variability and the importance of project-specific evaluation.

## Key findings

- Supervised predictors generally outperform unsupervised ones on a project-by-project basis.
- Variability exists in the efficacy of unsupervised predictors, requiring some supervised data for pruning.
- Both approaches may be combined to achieve comparable defect prediction performance.

## Abstract

Collecting quality data from software projects can be time-consuming and expensive. Hence, some researchers explore "unsupervised" approaches to quality prediction that does not require labelled data. An alternate technique is to use "supervised" approaches that learn models from project data labelled with, say, "defective" or "not-defective". Most researchers use these supervised models since, it is argued, they can exploit more knowledge of the projects.   At FSE'16, Yang et al. reported startling results where unsupervised defect predictors outperformed supervised predictors for effort-aware just-in-time defect prediction. If confirmed, these results would lead to a dramatic simplification of a seemingly complex task (data mining) that is widely explored in the software engineering literature.   This paper repeats and refutes those results as follows. (1) There is much variability in the efficacy of the Yang et al. predictors so even with their approach, some supervised data is required to prune weaker predictors away. (2)Their findings were grouped across $N$ projects. When we repeat their analysis on a project-by-project basis, supervised predictors are seen to work better.   Even though this paper rejects the specific conclusions of Yang et al., we still endorse their general goal. In our our experiments, supervised predictors did not perform outstandingly better than unsupervised ones for effort-aware just-in-time defect prediction. Hence, they may indeed be some combination of unsupervised learners to achieve comparable performance to supervised ones. We therefore encourage others to work in this promising area.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.00132/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/1703.00132/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/1703.00132/full.md

---
Source: https://tomesphere.com/paper/1703.00132