Is this Snippet Written by ChatGPT? An Empirical Study with a CodeBERT-Based Classifier
Phuong T. Nguyen, Juri Di Rocco, Claudio Di Sipio, Riccardo Rubei,, Davide Di Ruscio, Massimiliano Di Penta

TL;DR
This paper introduces GPTSniffer, a CodeBERT-based classifier, to detect AI-generated code snippets, demonstrating high accuracy and outperforming existing tools, with factors like training data similarity influencing performance.
Contribution
The paper presents GPTSniffer, a novel AI code detection method that improves accuracy over existing tools and analyzes factors affecting classification performance.
Findings
GPTSniffer outperforms GPTZero and OpenAI Text Classifier.
Training data similarity boosts detection accuracy.
Paired snippets in classification context improve results.
Abstract
Since its launch in November 2022, ChatGPT has gained popularity among users, especially programmers who use it as a tool to solve development problems. However, while offering a practical solution to programming problems, ChatGPT should be mainly used as a supporting tool (e.g., in software education) rather than as a replacement for the human being. Thus, detecting automatically generated source code by ChatGPT is necessary, and tools for identifying AI-generated content may need to be adapted to work effectively with source code. This paper presents an empirical study to investigate the feasibility of automated identification of AI-generated code snippets, and the factors that influence this ability. To this end, we propose a novel approach called GPTSniffer, which builds on top of CodeBERT to detect source code written by AI. The results show that GPTSniffer can accurately classify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification
