How the Misuse of a Dataset Harmed Semantic Clone Detection
Jens Krinke, Chaiyong Ragkhitwetsagul

TL;DR
This paper reveals that BigCloneBench, a widely used dataset for semantic clone detection evaluation, contains significant mislabeling issues that compromise the validity of many research results based on it.
Contribution
It provides a detailed analysis of BigCloneBench's flaws for semantic clone detection and highlights the impact on previous research and evaluation practices.
Findings
93% of sampled weak clone pairs are mislabelled with different functionality
139 out of 179 papers using BigCloneBench for semantic evaluation are affected by mislabelling
Misuse of BigCloneBench leads to overestimated F1 scores and misleading conclusions
Abstract
BigCloneBench is a well-known and widely used large-scale dataset for the evaluation of recall of clone detection tools. It has been beneficial for research on clone detection and has become a standard in evaluating the performance of clone detection tools. More recently, it has also been widely used as a dataset to evaluate machine learning approaches to semantic clone detection or code similarity detection for functional or semantic similarity. This paper demonstrates that BigCloneBench is problematic to use as ground truth for learning or evaluating semantic code similarity, and highlights the aspects of BigCloneBench that affect the ground truth quality. A manual investigation of a statistically significant random sample of 406 Weak Type-3/Type-4 clone pairs revealed that 93% of them do not have a similar functionality and are therefore mislabelled. In a literature review of 179…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Authorship Attribution and Profiling · Academic integrity and plagiarism
