Can Code Evaluation Metrics Detect Code Plagiarism?
Fahad Ebrahim, Mike Joy (The University of Warwick)

TL;DR
This study evaluates whether code evaluation metrics can effectively detect code plagiarism across various modification levels, comparing them with dedicated plagiarism detection tools.
Contribution
It provides an empirical comparison of code evaluation metrics and plagiarism detection tools, revealing their relative effectiveness at different modification levels.
Findings
Dolos performs best without preprocessing at the overall level.
CrystalBLEU, CodeBLEU, and RUBY outperform JPlag in ranking performance.
Performance declines at higher modification levels, but CrystalBLEU remains competitive.
Abstract
Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it remains unclear whether such metrics can reliably detect plagiarism across different levels of modification (L1-L6), increasing in complexity. In this paper, we perform a comparative empirical study using two open-source labelled datasets, ConPlag (raw and template-free versions) and IRPlag. We evaluate five CEMs, namely CodeBLEU, CrystalBLEU, RUBY, Tree Structured Edit Distance (TSED), and CodeBERTScore. The performance is evaluated using threshold-free ranking-based measures to assess overall, per dataset, and per-level plagiarism performance. The results are compared against state-of-the-art (SOTA) Source Code Plagiarism Detection Tools (SCPDTs),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
