Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code
Vahid Majdinasab, Amin Nikanjam, Foutse Khomh

TL;DR
This paper introduces TraWiC, a novel, model-agnostic method for detecting whether specific code snippets were included in an LLM's training data, achieving higher accuracy than existing clone detection tools.
Contribution
The paper presents TraWiC, a new interpretable approach based on membership inference that effectively detects code inclusion in LLM training datasets, addressing privacy and copyright concerns.
Findings
TraWiC detects 83.87% of training code snippets.
NiCad clone detection detects 47.64%.
TraWiC has low resource overhead.
Abstract
Code auditing ensures that the developed code adheres to standards, regulations, and copyright protection by verifying that it does not contain code from protected sources. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. The dataset for training these models is mainly collected from publicly available sources. This raises the issue of intellectual property infringement as developers' codes are already included in the dataset. Therefore, auditing code developed using LLMs is challenging, as it is difficult to reliably assert if an LLM used during development has been trained on specific copyrighted codes, given that we do not have access to the training datasets of these models. Given the non-disclosure of the training datasets, traditional approaches such as code clone detection are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Speech and dialogue systems
