FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

Elitsa Yotkova; Violeta Kastreva; Dimitar Dimitrov; Ivan Koychev; Preslav Nakov

arXiv:2605.04157·cs.CL·May 7, 2026

FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

Elitsa Yotkova, Violeta Kastreva, Dimitar Dimitrov, Ivan Koychev, Preslav Nakov

PDF

TL;DR

This paper presents a lightweight, efficient method for detecting machine-generated code across languages using stylometric features, parsing, and simple classifiers, achieving fast inference with minimal resources.

Contribution

It introduces a novel, resource-efficient approach combining ratio-based features, parsing, and heuristic rules for cross-language code detection, outperforming large models in speed.

Findings

01

Achieves near-instant inference time with CPU-only training.

02

Uses ratio-based features less sensitive to code snippet length.

03

Employs a combination of parsing, classifiers, and heuristics for detection.

Abstract

SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods. We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples. We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.