Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis
Li Ke, Hong Sheng, Fu Cai, Zhang Yunhe, Liu Ming

TL;DR
This paper develops a method to distinguish ChatGPT-generated code from human-written code using discriminative features, dataset cleansing, and data augmentation techniques, achieving high accuracy in authorship attribution.
Contribution
It introduces a novel discriminative feature set, dataset cleansing strategy, and data augmentation methods specifically for differentiating AI-generated code from human code.
Findings
High accuracy in binary classification of code authorship
Effective dataset cleansing technique for uncontaminated data
Successful generation of extensive ChatGPT code datasets
Abstract
The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Text Readability and Simplification
