Investigating Training Data Detection in AI Coders

Tianlin Li; Yunxiang Wei; Zhiming Li; Aishan Liu; Qing Guo; Xianglong Liu; Dongning Sun; Yang Liu

arXiv:2507.17389·cs.SE·July 24, 2025

Investigating Training Data Detection in AI Coders

Tianlin Li, Yunxiang Wei, Zhiming Li, Aishan Liu, Qing Guo, Xianglong Liu, Dongning Sun, Yang Liu

PDF

Open Access

TL;DR

This paper conducts a comprehensive empirical evaluation of training data detection methods for code large language models, introducing a new benchmark dataset and mutation strategies to assess robustness and guide future improvements.

Contribution

It introduces CodeSnitch, a function-level benchmark dataset for code data detection, and systematically evaluates seven state-of-the-art TDD methods across multiple models and mutation scenarios.

Findings

01

Current TDD methods have varying effectiveness on code data.

02

Mutation strategies reveal robustness challenges in existing TDD techniques.

03

The study offers insights for developing more reliable code data detection methods.

Abstract

Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering. However, these models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data, and posing risks to privacy and intellectual property. To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task. While recent TDD methods have shown promise in natural language settings, their effectiveness on code data remains largely underexplored. This gap is particularly important given code's structured syntax and distinct similarity criteria compared to natural language. To address this, we conduct a comprehensive empirical study of seven state-of-the-art TDD methods on source code data, evaluating their performance across eight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques