MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios

Basak Demirok; Mucahid Kutlu; Selin Mergen

arXiv:2507.21693·cs.SE·July 30, 2025

MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios

Basak Demirok, Mucahid Kutlu, Selin Mergen

PDF

TL;DR

This paper introduces MultiAIGCD, a large, diverse dataset for detecting AI-generated code across multiple languages, models, prompts, and scenarios, to aid research in maintaining academic and professional integrity.

Contribution

The paper presents MultiAIGCD, a comprehensive dataset with over 153,000 code snippets across three languages and multiple scenarios, and benchmarks current detection models on this dataset.

Findings

01

Detection models perform variably across languages and scenarios.

02

The dataset enables evaluation of cross-model and cross-language detection.

03

MultiAIGCD supports future research in AI-generated code detection.

Abstract

As large language models (LLMs) rapidly advance, their role in code generation has expanded significantly. While this offers streamlined development, it also creates concerns in areas like education and job interviews. Consequently, developing robust systems to detect AI-generated code is imperative to maintain academic integrity and ensure fairness in hiring processes. In this study, we introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go. From the CodeNet dataset's problem definitions and human-authored codes, we generate several code samples in Java, Python, and Go with six different LLMs and three different prompts. This generation process covered three key usage scenarios: (i) generating code from problem descriptions, (ii) fixing runtime errors in human-written code, and (iii) correcting incorrect outputs. Overall, MultiAIGCD consists of 121,271…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.