A Comparison of Similarity Based Instance Selection Methods for Cross Project Defect Prediction
Seyedrebvar Hosseini, Burak Turhan

TL;DR
This study compares locality sensitive hashing (LSH), NN-filter, and genetic instance selection (GIS) methods for cross-project defect prediction, finding LSH generally outperforms NN-filter in effectiveness and efficiency, especially in recall-focused scenarios.
Contribution
The paper provides a comprehensive comparison of LSH, NN-filter, and GIS for CPDP, demonstrating LSH's superior performance and lower computational cost across multiple datasets and learners.
Findings
LSH outperforms NN-filter in F-measure and recall.
LSH has lower computational overhead than exact neighborhood methods.
NN-filter performs better only when precision is prioritized.
Abstract
Context: Previous studies have shown that training data instance selection based on nearest neighborhood (NN) information can lead to better performance in cross project defect prediction (CPDP) by reducing heterogeneity in training datasets. However, neighborhood calculation is computationally expensive and approximate methods such as Locality Sensitive Hashing (LSH) can be as effective as exact methods. Aim: We aim at comparing instance selection methods for CPDP, namely LSH, NN-filter, and Genetic Instance Selection (GIS). Method: We conduct experiments with five base learners, optimizing their hyper parameters, on 13 datasets from PROMISE repository in order to compare the performance of LSH with benchmark instance selection methods NN-Filter and GIS. Results: The statistical tests show six distinct groups for F-measure performance. The top two group contains only LSH and GIS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
