Code Fingerprints: Disentangled Attribution of LLM-Generated Code

Jiaxun Guo; Ziyuan Yang; Mengyu Sun; Hui Wang; Jingfeng Lu; and Yi Zhang

arXiv:2603.04212·cs.SE·March 5, 2026

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu, and Yi Zhang

PDF

Open Access

TL;DR

This paper introduces DCAN, a novel neural network approach that disentangles semantic and stylistic features to accurately attribute code snippets to their source LLMs across multiple languages.

Contribution

The paper presents the first large-scale benchmark dataset for LLM code attribution and proposes DCAN, a contrastive learning model that effectively identifies the source LLM of generated code.

Findings

01

DCAN achieves high attribution accuracy across models and languages.

02

The benchmark dataset enables systematic evaluation of code attribution methods.

03

Disentangling semantic and stylistic features improves attribution reliability.

Abstract

The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While these systems improve productivity, they introduce new challenges for software governance, accountability, and compliance. Existing research primarily focuses on distinguishing machine-generated code from human-written code; however, many practical scenarios--such as vulnerability triage, incident investigation, and licensing audits--require identifying which LLM produced a given code snippet. In this paper, we study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code. Although attribution is challenging, differences in training data, architectures, alignment strategies, and decoding mechanisms introduce model-dependent stylistic and structural variations that serve as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Machine Learning in Materials Science