Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance
Yewei Song, Cedric Lothritz, Daniel Tang, Tegawend\'e F. Bissyand\'e, Jacques Klein

TL;DR
This paper evaluates the effectiveness of Abstract Syntax Tree (AST) edit distance for code similarity measurement across multiple programming languages, comparing it with traditional metrics and proposing an improved, adaptable metric called TSED.
Contribution
It introduces an optimized, adaptable AST-based code similarity metric, TSED, and provides a comprehensive comparison with traditional and modern similarity measures.
Findings
AST edit distance correlates highly with established metrics
TSED outperforms traditional similarity metrics across tested languages
AST-based metrics capture complex code structures effectively
Abstract
This paper revisits recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance in diverse programming languages. In particular, we explore the usefulness of these metrics and compare them to traditional sequence similarity metrics. Our experiments showcase the effectiveness of AST editing distance in capturing intricate code structures, revealing a high correlation with established metrics. Furthermore, we explore the strengths and weaknesses of AST editing distance and prompt-based GPT similarity scores in comparison to BLEU score, execution match, and Jaccard Similarity. We propose, optimize, and publish an adaptable metric that demonstrates effectiveness across all tested languages, representing an enhanced version of Tree Similarity of Edit Distance (TSED).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software System Performance and Reliability
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Adam · Linear Layer · Layer Normalization · Discriminative Fine-Tuning · Weight Decay · Byte Pair Encoding · Cosine Annealing
