AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel; Dilshod Azizov; Indraneil Paul; Yuxia Wang; Iryna Gurevych; Preslav Nakov

arXiv:2602.02079·cs.LG·February 3, 2026

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel, Dilshod Azizov, Indraneil Paul, Yuxia Wang, Iryna Gurevych, Preslav Nakov

PDF

Open Access 1 Video

TL;DR

AICD Bench is a comprehensive, large-scale benchmark designed to evaluate AI-generated code detection across multiple models, languages, and realistic scenarios, revealing current detectors' limitations especially under challenging conditions.

Contribution

Introduces AICD Bench, the largest and most diverse benchmark for AI-generated code detection, including new tasks and extensive evaluation of existing detectors.

Findings

01

Detection performance is significantly below practical usability.

02

Models struggle under distribution shifts and with hybrid or adversarial code.

03

Current detectors are insufficient for robust real-world application.

Abstract

Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $AICD Bench$ , the most comprehensive benchmark for AI-generated code detection. It spans $2M examples$ , $77 models$ across $11 families$ , and $9 programming languages$ , including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ( $i$ )~ $Robust Binary Classification$ under distribution shifts in language and domain, ( $ii$ )~ $Model Family Attribution$ , grouping generators by architectural lineage, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection· underline

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Adversarial Robustness in Machine Learning