Is This You, LLM? Recognizing AI-written Programs with Multilingual Code   Stylometry

Andrea Gurioli (DISI; UNIBO); Maurizio Gabbrielli (DISI; UNIBO),; Stefano Zacchiroli (IP Paris; LTCI; ACES; INFRES)

arXiv:2412.14611·cs.SE·December 20, 2024

Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry

Andrea Gurioli (DISI, UNIBO), Maurizio Gabbrielli (DISI, UNIBO),, Stefano Zacchiroli (IP Paris, LTCI, ACES, INFRES)

PDF

Open Access

TL;DR

This paper presents a transformer-based classifier capable of detecting AI-generated code across 10 programming languages with high accuracy, supported by a new open dataset and a fully reproducible experimental pipeline.

Contribution

Introduces a multilingual AI code stylometry classifier and an open dataset, enabling detection of AI-written code across multiple languages with high accuracy.

Findings

01

Achieved 84.1% average accuracy across 10 languages

02

Developed a fully reproducible pipeline for AI code detection

03

Relied solely on open LLMs for experiments

Abstract

With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing-in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical concerns.We introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code generated by LLMs from code written by humans, based on a transformer-based encoder classifier. Differently from previous work, our classifier is capable of detecting AI-written code across 10 different programming languages with a single machine learning model, maintaining high average accuracy across all languages (84.1% $\pm$ 3.8%).Together with the classifier we also release H-AIRosettaMP, a novel open dataset for AI code stylometry tasks, consisting of 121 247 code snippets in 10 popular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research