GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench
Ajmain Inqiad Alam, Palash Ranjan Roy, Farouq Al-omari, Chanchal Kumar, Roy, Banani Roy, Kevin Schneider

TL;DR
This paper introduces GPTCloneBench, a large-scale benchmark for semantic and cross-language code clones generated using GPT-3 and existing datasets, addressing limitations of prior benchmarks and supporting multiple programming languages.
Contribution
The work presents GPTCloneBench, a comprehensive and larger benchmark for semantic and cross-language code clones, created through GPT-3 prompt engineering and extensive validation, surpassing previous datasets in size and diversity.
Findings
GPTCloneBench contains 37,149 true semantic clone pairs.
The benchmark includes 20,770 cross-language clone pairs.
It is 15 times larger than SemanticCloneBench.
Abstract
With the emergence of Machine Learning, there has been a surge in leveraging its capabilities for problem-solving across various domains. In the code clone realm, the identification of type-4 or semantic clones has emerged as a crucial yet challenging task. Researchers aim to utilize Machine Learning to tackle this challenge, often relying on the BigCloneBench dataset. However, it's worth noting that BigCloneBench, originally not designed for semantic clone detection, presents several limitations that hinder its suitability as a comprehensive training dataset for this specific purpose. Furthermore, CLCDSA dataset suffers from a lack of reusable examples aligning with real-world software systems, rendering it inadequate for cross-language clone detection approaches. In this work, we present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench by exploiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques
