Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via   Code Rewriting

Tong Ye; Yangkai Du; Tengfei Ma; Lingfei Wu; Xuhong Zhang; Shouling; Ji; Wenhai Wang

arXiv:2405.16133·cs.SE·December 17, 2024

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Tong Ye, Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling, Ji, Wenhai Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a zero-shot method for detecting LLM-generated code by analyzing the similarity between original and rewritten code, significantly outperforming existing detectors on benchmark datasets.

Contribution

It proposes a novel zero-shot synthetic code detection approach using code rewriting and contrastive learning, addressing the limitations of general text detectors for programming code.

Findings

01

Achieved 20.5% AUROC improvement on APPS benchmark.

02

Achieved 29.1% AUROC improvement on MBPP benchmark.

03

Outperforms existing state-of-the-art synthetic code detectors.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating code. However, the misuse of LLM-generated (synthetic) code has raised concerns in both educational and industrial contexts, underscoring the urgent need for synthetic code detectors. Existing methods for detecting synthetic content are primarily designed for general text and struggle with code due to the unique grammatical structure of programming languages and the presence of numerous ''low-entropy'' tokens. Building on this, our work proposes a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants. Our method is based on the observation that differences between LLM-rewritten and original code tend to be smaller when the original code is synthetic. We utilize self-supervised contrastive learning to train a code similarity model and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting· underline

Taxonomy

TopicsDigital and Cyber Forensics

MethodsContrastive Learning