Investigating Training Data Detection in AI Coders
Tianlin Li, Yunxiang Wei, Zhiming Li, Aishan Liu, Qing Guo, Xianglong Liu, Dongning Sun, Yang Liu

TL;DR
This paper conducts a comprehensive empirical evaluation of training data detection methods for code large language models, introducing a new benchmark dataset and mutation strategies to assess robustness and guide future improvements.
Contribution
It introduces CodeSnitch, a function-level benchmark dataset for code data detection, and systematically evaluates seven state-of-the-art TDD methods across multiple models and mutation scenarios.
Findings
Current TDD methods have varying effectiveness on code data.
Mutation strategies reveal robustness challenges in existing TDD techniques.
The study offers insights for developing more reliable code data detection methods.
Abstract
Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering. However, these models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data, and posing risks to privacy and intellectual property. To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task. While recent TDD methods have shown promise in natural language settings, their effectiveness on code data remains largely underexplored. This gap is particularly important given code's structured syntax and distinct similarity criteria compared to natural language. To address this, we conduct a comprehensive empirical study of seven state-of-the-art TDD methods on source code data, evaluating their performance across eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques
