A Comparison of Similarity Based Instance Selection Methods for Cross   Project Defect Prediction

Seyedrebvar Hosseini; Burak Turhan

arXiv:2104.01024·cs.LG·April 5, 2021

A Comparison of Similarity Based Instance Selection Methods for Cross Project Defect Prediction

Seyedrebvar Hosseini, Burak Turhan

PDF

TL;DR

This study compares locality sensitive hashing (LSH), NN-filter, and genetic instance selection (GIS) methods for cross-project defect prediction, finding LSH generally outperforms NN-filter in effectiveness and efficiency, especially in recall-focused scenarios.

Contribution

The paper provides a comprehensive comparison of LSH, NN-filter, and GIS for CPDP, demonstrating LSH's superior performance and lower computational cost across multiple datasets and learners.

Findings

01

LSH outperforms NN-filter in F-measure and recall.

02

LSH has lower computational overhead than exact neighborhood methods.

03

NN-filter performs better only when precision is prioritized.

Abstract

Context: Previous studies have shown that training data instance selection based on nearest neighborhood (NN) information can lead to better performance in cross project defect prediction (CPDP) by reducing heterogeneity in training datasets. However, neighborhood calculation is computationally expensive and approximate methods such as Locality Sensitive Hashing (LSH) can be as effective as exact methods. Aim: We aim at comparing instance selection methods for CPDP, namely LSH, NN-filter, and Genetic Instance Selection (GIS). Method: We conduct experiments with five base learners, optimizing their hyper parameters, on 13 datasets from PROMISE repository in order to compare the performance of LSH with benchmark instance selection methods NN-Filter and GIS. Results: The statistical tests show six distinct groups for F-measure performance. The top two group contains only LSH and GIS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.