Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features
Jorge Martinez-Gil

TL;DR
This paper enhances source code similarity detection by integrating behavioral features derived from execution signals into GraphCodeBERT, leading to improved accuracy especially for semantically similar but syntactically different code pairs.
Contribution
It introduces a novel method to incorporate execution-derived behavioral features into GraphCodeBERT for better semantic code similarity detection.
Findings
Improved precision, recall, and F1 scores on clone detection benchmarks.
Largest gains observed on semantically equivalent but syntactically dissimilar code pairs.
Demonstrated effectiveness of combining behavioral signals with transformer representations.
Abstract
This paper investigates source code similarity detection using a transformer model augmented with an execution-derived signal. We extend GraphCodeBERT with an explicit, low-dimensional behavioral feature that captures observable agreement between code fragments, and fuse this signal with the pooled transformer representation through a trainable output head. We compute behavioral agreement via output comparisons under a fixed test suite and use this observed output agreement as an operational approximation of semantic similarity between code pairs. The resulting feature acts as an explicit behavioral signature that complements token- and graph-based representations. Experiments on established clone detection benchmarks show consistent improvements in precision, recall, and F over the unmodified GraphCodeBERT baseline, with the largest gains on semantically equivalent but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Web Data Mining and Analysis · Advanced Malware Detection Techniques
