Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection
Muslim Chochlov (1), Gul Aftab Ahmed (2), James Vincent Patten (1),, Guoxian Lu (3), Wei Hou (4), David Gregg (2), Jim Buckley (1) ((1) Deptment, of Computer Science, Information Systems, University of Limerick, Ireland,, (2) Deptment of Computer Science, Trinity College Dublin

TL;DR
This paper introduces SSCD, a scalable BERT-based clone detection method that efficiently identifies inexact code clones in large codebases by using embeddings and nearest neighbor search, outperforming existing approaches.
Contribution
The paper presents SSCD, a novel BERT-based clone detection approach that improves scalability and effectiveness for detecting complex code clones in industrial-scale datasets.
Findings
SSCD outperforms state-of-the-art clone detection tools like SAGA and SourcererCC.
SSCD can process 320 million lines of code in under three hours.
Shorter input lengths and text-only models enhance efficiency with minimal effectiveness loss.
Abstract
Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner's requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
