CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection
Shihan Dou, Yueming Wu, Haoxiang Jia, Yuhao Zhou, Yan Liu, Yang Liu

TL;DR
CC2Vec is a novel code encoding approach that combines typed tokens and contrastive learning to efficiently detect both simple and semantic code clones, outperforming existing methods in speed and effectiveness.
Contribution
Introduces CC2Vec, a new code clone detection method that encodes typed tokens with self-attention and uses contrastive learning to improve semantic clone detection.
Findings
Effectively detects simple code clones.
Achieves comparable performance to semantic clone detectors after fine-tuning.
Significantly surpasses existing methods in detection efficiency.
Abstract
With the development of the open source community, the code is often copied, spread, and evolved in multiple software systems, which brings uncertainty and risk to the software system (e.g., bug propagation and copyright infringement). Therefore, it is important to conduct code clone detection to discover similar code pairs. Many approaches have been proposed to detect code clones where token-based tools can scale to big code. However, due to the lack of program details, they cannot handle more complicated code clones, i.e., semantic code clones. In this paper, we introduce CC2Vec, a novel code encoding method designed to swiftly identify simple code clones while also enhancing the capability for semantic code clone detection. To retain the program details between tokens, CC2Vec divides them into different categories (i.e., typed tokens) according to the syntactic types and then applies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Web Application Security Vulnerabilities · Software Testing and Debugging Techniques
