On the Use of Deep Learning Models for Semantic Clone Detection
Subroto Nag Pinku, Debajyoti Mondal, Chanchal K. Roy

TL;DR
This paper evaluates deep learning models for semantic clone detection across multiple datasets, revealing insights into their robustness and generalizability, especially highlighting the strengths of cross-language models like C4.
Contribution
It introduces a multi-step evaluation framework for clone detection models across diverse datasets, including a new GPT-assisted dataset, and compares their robustness using mutation operators.
Findings
Single-language models perform well on BigCloneBench but vary on SemanticCloneBench.
Cross-language model C4 outperforms others on SemanticCloneBench and shows consistent robustness.
Mutation testing reveals C4's superior stability across datasets.
Abstract
Detecting and tracking code clones can ease various software development and maintenance tasks when changes in a code fragment should be propagated over all its copies. Several deep learning-based clone detection models have appeared in the literature for detecting syntactic and semantic clones, widely evaluated with the BigCloneBench dataset. However, class imbalance and the small number of semantic clones make BigCloneBench less ideal for interpreting model performance. Researchers also use other datasets such as GoogleCodeJam, OJClone, and SemanticCloneBench to understand model generalizability. To overcome the limitations of existing datasets, the GPT-assisted semantic and cross-language clone dataset GPTCloneBench has been released. However, how these models compare across datasets remains unclear. In this paper, we propose a multi-step evaluation approach for five state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Spam and Phishing Detection
