SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch mode and During Software Development
Vaibhav Saini, Hitesh Sajnani, Jaewoo Kim, Cristina Lopes

TL;DR
This paper introduces SourcererCC and SourcererCC-I, scalable clone detection tools for large code repositories and real-time development, achieving high accuracy and efficiency in detecting various clone types.
Contribution
The paper presents a scalable, token-based clone detector and an Eclipse plugin that together enable efficient detection of near-miss clones in large-scale software repositories and during development.
Findings
SourcererCC scales to 250 million lines of code on standard hardware.
Achieves 86% precision and 86-100% recall in clone detection.
Outperforms existing tools in large-scale clone detection.
Abstract
Given the availability of large source-code repositories, there has been a large number of applications for large-scale clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to big software systems or large repositories, specifically for detecting near-miss (Type 3) clones where significant editing activities may take place in the cloned code. This paper demonstrates: (i) SourcererCC, a token-based clone detector that targets the first three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. It uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
