Evaluating Small-Scale Code Models for Code Clone Detection

Jorge Martinez-Gil

arXiv:2506.10995·cs.SE·June 16, 2025

Evaluating Small-Scale Code Models for Code Clone Detection

Jorge Martinez-Gil

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates the performance of small code models in detecting code clones across multiple datasets, highlighting their strengths and remaining challenges in identifying functionally similar but structurally different code.

Contribution

It provides a comprehensive benchmark of recent small code models for clone detection, offering insights into their effectiveness and limitations.

Findings

01

Most models perform well on standard metrics.

02

Detecting clones with similar structure but different functionality remains challenging.

03

Source code for the evaluation is publicly available.

Abstract

Detecting code clones is relevant to software maintenance and code refactoring. This challenge still presents unresolved cases, mainly when structural similarity does not reflect functional equivalence, though recent code models show promise. Therefore, this research aims to systematically measure the performance of several newly introduced small code models in classifying code pairs as clones or non-clones. The evaluation is based on five datasets: BigCloneBench, CodeJam, Karnalim, POJ104, and PoolC, as well as six code models: CodeBERT, GraphCodeBERT, Salesforce T5, UniXCoder, PLBART, and Polycoder. Most models performed well across standard metrics, including accuracy, precision, recall, and F1-score. However, a marginal fraction of clones remains challenging to detect, especially when the code looks similar but performs different operations. The source code that illustrates our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jorge-martinez-gil/small-code-models
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Software Engineering Research · Advanced Malware Detection Techniques