Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features
Shinwoo Park, Hyundong Jin, Jeong-won Cha, Yo-Sub Han

TL;DR
This paper introduces a method to detect LLM-paraphrased code and identify the specific LLM used, leveraging coding style features, with significant improvements over baselines in accuracy and speed.
Contribution
The authors create a dataset and develop LPcodedec, a novel detection approach that outperforms existing methods in identifying paraphrased code and its originating LLM.
Findings
Significant stylistic differences between human and LLM-generated code.
LPcodedec achieves higher F1 scores than baselines.
The method is significantly faster, with over 200x speedups.
Abstract
Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely resemble the original. While the potential for LLM-assisted code paraphrasing continues to grow, research on detecting it remains limited, underscoring an urgent need for detection system. We respond to this need by proposing two tasks. The first task is to detect whether code generated by an LLM is a paraphrased version of original human-written code. The second task is to identify which LLM is used to paraphrase the original code. For these tasks, we construct a dataset LPcode consisting of pairs of human-written code and LLM-paraphrased code using various LLMs. We statistically confirm significant differences in the coding styles of human-written and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
