Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code
Syed Mehedi Hasan Nirob, Shamim Ehsan, Moqsadur Rahman, Summit Haque

TL;DR
This paper compares feature-based and embedding-based methods for detecting whether code is human-written or AI-generated, finding both approaches highly effective with different interpretability and generalization trade-offs.
Contribution
It introduces and evaluates two complementary detection approaches using a large dataset, highlighting the importance of whitespace features and embedding semantics.
Findings
Feature-based models achieve ROC-AUC 0.995.
Embedding-based models with CodeBERT achieve ROC-AUC 0.994.
Whitespace and indentation features are highly discriminative.
Abstract
Large language models (LLMs) have made it remarkably easy to synthesize plausible source code from natural language prompts. While this accelerates software development and supports learning, it also raises new risks for academic integrity, authorship attribution, and responsible AI use. This paper investigates the problem of distinguishing human-written from machine-generated code by comparing two complementary approaches: feature-based detectors built from lightweight, interpretable stylometric and structural properties of code, and embedding-based detectors leveraging pretrained code encoders. Using a recent large-scale benchmark dataset of 600k human-written and AI-generated code samples, we find that feature-based models achieve strong performance (ROC-AUC 0.995, PR-AUC 0.995, F1 0.971), while embedding-based models with CodeBERT embeddings are also very competitive (ROC-AUC 0.994,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Authorship Attribution and Profiling · Advanced Malware Detection Techniques
