AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
Daniil Orel, Dilshod Azizov, Indraneil Paul, Yuxia Wang, Iryna Gurevych, Preslav Nakov

TL;DR
AICD Bench is a comprehensive, large-scale benchmark designed to evaluate AI-generated code detection across multiple models, languages, and realistic scenarios, revealing current detectors' limitations especially under challenging conditions.
Contribution
Introduces AICD Bench, the largest and most diverse benchmark for AI-generated code detection, including new tasks and extensive evaluation of existing detectors.
Findings
Detection performance is significantly below practical usability.
Models struggle under distribution shifts and with hybrid or adversarial code.
Current detectors are insufficient for robust real-world application.
Abstract
Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce , the most comprehensive benchmark for AI-generated code detection. It spans , across , and , including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ()~ under distribution shifts in language and domain, ()~, grouping generators by architectural lineage, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Adversarial Robustness in Machine Learning
