A Comparative Study of Various Distance Measures for Software fault   prediction

Deepinder Kaur

arXiv:1411.7474·cs.SE·December 1, 2014

A Comparative Study of Various Distance Measures for Software fault prediction

Deepinder Kaur

PDF

TL;DR

This paper compares Euclidean, Sorensen, and Canberra distance measures in K-means clustering for software fault prediction, demonstrating Sorensen's superior performance on NASA datasets through ROC analysis.

Contribution

It provides an empirical comparison of three distance measures in K-means clustering for software fault prediction, highlighting Sorensen's effectiveness.

Findings

01

Sorensen distance outperforms Euclidean and Canberra in fault prediction accuracy.

02

K-means with Sorensen distance yields better ROC curve results.

03

The study uses NASA MDP datasets for evaluation.

Abstract

Different distance measures have been used for efficiently predicting software faults at early stages of software development. One stereotyped approach for software fault prediction due to its computational efficiency is K-means clustering, which partitions the dataset into K number of clusters using any distance measure. Distance measures by using some metrics are used to extract similar data objects which help in developing efficient algorithms for clustering and classification. In this paper, we study K-means clustering with three different distance measures Euclidean, Sorensen and Canberra by using datasets that have been collected from NASA MDP (metrics data program) .Results are displayed with the help of ROC curve. The experimental results shows that K-means clustering with Sorensen distance is better than Euclidean distance and Canberra distance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.