AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

Basak Demirok; Mucahid Kutlu

arXiv:2412.16594·cs.SE·May 27, 2025

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

Basak Demirok, Mucahid Kutlu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AIGCodeSet, a new annotated dataset of human and AI-generated Python code, and evaluates baseline detection methods, highlighting the Bayesian classifier's superior performance.

Contribution

The creation of AIGCodeSet, a large, annotated dataset for AI-generated code detection, and the experimental comparison of baseline detection methods.

Findings

01

Bayesian classifier outperforms other models in detection accuracy.

02

AIGCodeSet contains 2,828 AI-generated and 4,755 human-written Python codes.

03

Experiments demonstrate effectiveness of baseline detection methods.

Abstract

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

basakdemirok/AIGCodeSet
dataset· 160 dl
160 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Biochemical and Structural Characterization · Text Readability and Simplification