Clone Detection on Large Scala Codebases
Wahidur Rahman, Yisen Xu, Fan Pu, Jifeng Xuan, Xiangyang Jia, Michail, Basios, Leslie Kanthan, Lingbo Li, Fan Wu, Baowen Xu

TL;DR
This study evaluates the performance of two advanced code clone detection techniques on large Scala codebases, revealing significant differences in effectiveness between open source and industrial projects.
Contribution
It provides the first large-scale industrial evaluation of SourcererCC and AutoenCODE on Scala projects, highlighting their varying performance in real-world settings.
Findings
Both algorithms show performance drops on industrial code.
Largest precision drop observed was 30.7%.
Largest recall increase was 32.4%.
Abstract
Code clones are identical or similar code segments. The wide existence of code clones can increase the cost of maintenance and jeopardise the quality of software. The research community has developed many techniques to detect code clones, however, there is little evidence of how these techniques may perform in industrial use cases. In this paper, we aim to uncover the differences when such techniques are applied in industrial use cases. We conducted large scale experimental research on the performance of two state-of-the-art code clone detection techniques, SourcererCC and AutoenCODE, on both open source projects and an industrial project written in the Scala language. Our results reveal that both algorithms perform differently on the industrial project, with the largest drop in precision being 30.7\%, and the largest increase in recall being 32.4\%. By manually labelling samples of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Web Data Mining and Analysis
