One Size Does Not Fit All: Investigating Efficacy of Perplexity in Detecting LLM-Generated Code
Jinwei Xu, He Zhang, Yanjing Yang, Lanxin Yang, Zeru Cheng, Jun Lyu, Bohan Liu, Xin Zhou, Alberto Bacchelli, Yin Kia Chiam, Thiam Kian Chiew

TL;DR
This paper evaluates the effectiveness of the perplexity-based method for detecting large language model-generated code, revealing its strengths in generalization but limitations in accuracy and speed across various realistic scenarios.
Contribution
It provides the first large-scale analysis of perplexity-based detection, comparing it with other methods across multiple criteria and offering practical recommendations for improvement.
Findings
PERPLEXITY has the best generalization capability.
PERPLEXITY shows limited detection accuracy.
PERPLEXITY is unsuitable for high-level languages.
Abstract
Large language model-generated code (LLMgCode) has become increasingly common in software development. So far LLMgCode has more quality issues than human-authored code (HaCode). It is common for LLMgCode to mix with HaCode in a code change, while the change is signed by only human developers, without being carefully examined. Many automated methods have been proposed to detect LLMgCode from HaCode, in which the perplexity-based method (PERPLEXITY for short) is the state-of-the-art method. However, the efficacy evaluation of PERPLEXITY has focused on detection accuracy. Yet it is unclear whether PERPLEXITY is good enough in a wider range of realistic evaluation settings. To this end, we carry out a family of experiments to compare PERPLEXITY against feature- and pre-training-based methods from three perspectives: detection accuracy, detection speed, and generalization capability. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
