Code Fingerprints: Disentangled Attribution of LLM-Generated Code
Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu, and Yi Zhang

TL;DR
This paper introduces DCAN, a novel neural network approach that disentangles semantic and stylistic features to accurately attribute code snippets to their source LLMs across multiple languages.
Contribution
The paper presents the first large-scale benchmark dataset for LLM code attribution and proposes DCAN, a contrastive learning model that effectively identifies the source LLM of generated code.
Findings
DCAN achieves high attribution accuracy across models and languages.
The benchmark dataset enables systematic evaluation of code attribution methods.
Disentangling semantic and stylistic features improves attribution reliability.
Abstract
The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While these systems improve productivity, they introduce new challenges for software governance, accountability, and compliance. Existing research primarily focuses on distinguishing machine-generated code from human-written code; however, many practical scenarios--such as vulnerability triage, incident investigation, and licensing audits--require identifying which LLM produced a given code snippet. In this paper, we study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code. Although attribution is challenging, differences in training data, architectures, alignment strategies, and decoding mechanisms introduce model-dependent stylistic and structural variations that serve as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Machine Learning in Materials Science
