RepoMark: A Data-Usage Auditing Framework for Code Large Language Models
Wenjie Qu, Yuguang Zhou, Bo Wang, Yuexin Li, Lionel Z. Wang, Jinyuan Jia, Jiaheng Zhang

TL;DR
RepoMark is a novel framework that enables effective auditing of code data used in training large language models, ensuring transparency and legal compliance with high detection accuracy and theoretical guarantees.
Contribution
It introduces a data marking and detection method for code LLMs that guarantees semantic preservation, imperceptibility, and false detection rate bounds, improving data auditing efficiency and accuracy.
Findings
Achieves over 90% detection success rate on small code repositories.
Outperforms prior techniques with accuracy below 55%.
Provides a theoretically sound auditing method with strict FDR guarantees.
Abstract
The rapid development of Large Language Models (LLMs) for code generation has transformed software development by automating coding tasks with unprecedented efficiency. However, the training of these models on open-source code repositories (e.g., from GitHub) raises critical ethical and legal concerns, particularly regarding data authorization and open-source license compliance. Developers are increasingly questioning whether model trainers have obtained proper authorization before using repositories for training, especially given the lack of transparency in data collection. To address these concerns, we propose a novel data marking framework RepoMark to audit the data usage of code LLMs. Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation, imperceptibility, and theoretical false detection rate (FDR) guarantees. By…
| Category |
|
DSR (%) | |||||||
| 1% | 2% | 5% | 10% | 20% | |||||
| Loss attack [39] |
|
✗ | 0.97 | 1.14 | 2.53 | 7.98 | 19.14 | ||
| min-k [40] (ICLR’24) |
|
✗ | 3.50 | 5.25 | 8.37 | 15.95 | 34.63 | ||
| zlib [41] |
|
✗ | 1.53 | 3.33 | 8.56 | 15.68 | 30.58 | ||
| Dataset inference [23] (NeurIPS’24) |
|
✗ | 8.56 | 15.12 | 18.68 | 29.79 | 54.67 | ||
| CodeMark [15] (FSE’23) | Backdoor marking | ✗ | 0.32 | 0.86 | 1.28 | 3.45 | 9.15 | ||
|
Contrastive marking | ✓ | 39.39 | 43.31 | 53.79 | 63.63 | 77.22 | ||
| RepoMark | Contrastive marking | ✓ | 84.94 | 86.39 | 92.44 | 96.75 | 97.07 | ||
| Edit distance | PPL | CodeBLEU | |
| Unmarked | 0 | 1.04 | 1 |
| RepoMark | 3.61 | 1.11 | 0.9633 |
| Edit distance | PPL | CodeBLEU | ||
|---|---|---|---|---|
| Impact of | 6.74 | 1.20 | 0.8059 | |
| 3.61 | 1.11 | 0.9633 | ||
| 1.89 | 1.10 | 0.9842 | ||
| 0.97 | 1.09 | 0.9992 |
| Metric | Without mark | With mark |
|---|---|---|
| Pass@1 | 3.44% | 3.38% |
| Pass@10 | 7.80% | 7.72% |
| Pass@100 | 15.24% | 15.11% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Model-Driven Software Engineering Techniques
RepoMark: A Data-Usage Auditing Framework for Code Large Language Models
Wenjie Qu1, Yuguang Zhou1, Bo Wang1, Yuexin Li1, Lionel Z. Wang2, Jinyuan Jia3, Jiaheng Zhang1
Abstract
The rapid development of Large Language Models (LLMs) for code generation has transformed software development by automating coding tasks with unprecedented efficiency. However, the training of these models on open-source code repositories (e.g., from GitHub) raises critical ethical and legal concerns, particularly regarding data authorization and open-source license compliance. Developers are increasingly questioning whether model trainers have obtained proper authorization to use repositories for training, especially given the lack of transparency in data collection.
To address these concerns, we propose a new data marking framework RepoMark to audit the data usage of code LLMs. Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation, imperceptibility, and a theoretical guarantee on false detection rate (FDR). By generating multiple semantically equivalent code variants, RepoMark introduces data marks into the code files, and during detection, RepoMark leverages a new ranking-based hypothesis test to detect model behavior difference on trained data. Compared to prior data auditing approaches, RepoMark significantly enhances data efficiency, allowing effective auditing even when the user’s repository possesses only a small number of code files.
Experiments demonstrate that RepoMark achieves a detection success rate over 90% on small code repositories under a strict FDR guarantee of 5%. This represents a significant advancement over existing data marking techniques, all of which only achieve accuracy below 55% under identical settings. These results validate RepoMark as a robust, theoretically sound, and promising solution for enhancing transparency in code LLM training, which can safeguard the rights of code authors.
1 Introduction
The rapid advancement of Large Language Models (LLMs) in code generation has revolutionized software development, enabling developers to automate coding tasks with unprecedented efficiency. Open-source communities such as GitHub, which maintain a large set of code repositories, have become the main source of training data for code LLMs [1, 2, 3]. However, the use of publicly available code repositories to train these models has raised critical ethical and legal questions.
A primary concern is data authorization—specifically, whether model developers have obtained proper consent from code authors to use their repositories for training purposes. While source code itself may not always constitute personal data, developers’ rights regarding the usage and distribution of their code align with broader data protection principles established by regulations such as the General Data Protection Regulation (GDPR) [4] in Europe, which grant data owners the right to know how their data is used. Nevertheless, the training of machine learning models often suffers from a pervasive lack of transparency. Model trainers rarely disclose the details of the origin of their training data [5], and the opaque data collection processes employed in training code LLMs make it difficult to audit whether proper authorization was obtained from code authors for model training. This lack of transparency is especially concerning given the emergence of commercial products such as GitHub Copilot [6], Amazon CodeWhisperer [1], Cursor [3], Tabnine [7], and Codeium [8]. Exploiting source code without authorization to train models for commercial purposes directly violates the rights and contributions of original authors and undermines the spirit of the open-source community.
Given these ethical and legal concerns, tracing the data-usage of code LLMs has become a pressing challenge. The most prevalent approach to address this challenge is data-usage auditing [5], which enables auditors to verify whether their data has been used to train a machine learning (ML) model. Existing data-usage auditing methods can be categorized into two classes: membership inference and data marking. Membership inference [9, 10, 11] infers if a data sample is a member of an ML model’s training set without modifying the training data. In contrast, data marking [12, 13, 5, 14] modifies the data samples prior to publication, enabling detection methods to exploit statistical signals introduced during marking to detect data-usage in training. In general, data marking leverages more information and often achieves higher detection accuracy [5] than membership inference. Therefore, we design an auditing method for code LLMs based on the data marking paradigm in this paper.
To effectively audit code LLMs, our auditing method should satisfy four properties: (1) Preservation of Semantics: The modifications introduced by the marking method must retain the original code’s semantics. (2) Imperceptibility: The code marks should be difficult for model trainers to identify, as they are incentivized to remove the marked code files to avoid being caught. (3) Data Efficiency: In the real world, many individual repositories only have 10–50 code files. The method should maintain sufficient accuracy in detecting model training on such small-scale repositories. (4) FDR Guarantee: The method should provide a guarantee on the false detection rate (FDR). FDR is defined as the probability of incorrectly flagging a target model as having been trained on a repository when no such training occurred. A theoretical FDR guarantee is imperative, as false accusations against the model trainer could lead to significant reputational harm and ethical breaches.
However, no existing data marking scheme satisfies all four properties simultaneously. First, many methods—e.g., CodeMark [15] and Huang et al. [5] (CCS’24)—are not sufficiently data efficient and deliver suboptimal accuracy when auditing small individual repositories, as shown in our experiments. Second, as noted by [5], most auditing approaches do not provide a provable guarantee on FDR.
To simultaneously achieve the four desired properties, we first introduce a new code auditing framework that provides enhanced data efficiency and a provable FDR guarantee, thereby addressing the latter two properties. Building upon this framework, we design a concrete marking algorithm specifically tailored to the code domain, which together satisfy all desired properties.
Code auditing framework. In our code auditing framework, we consider three parties: the code author, the model trainer, and the auditor (e.g., a court). The code author owns a repository composed of multiple source files. Before releasing it publicly, the author embeds marks into the code to enable future auditing. To insert these marks, the author generates semantically equivalent variants for each code file and randomly selects one version to publish while keeping the remaining variants private. Later, the model trainer collects large amounts of online code without authorization and uses it to train a code LLM, which may include the marked repository. During the detection phase, the auditor computes the code LLM-based loss of the published version and all its variants, ranks the published version within each set, and performs a hypothesis test over the aggregated ranks across all files to determine whether the model was trained on the marked repository.
The core idea of our framework is to capture and aggregate the statistical behavior differences between the trained and untrained cases for each code file. In particular, due to the randomness in the selection of the published version, if the code LLM is not trained on a code file, the rank of the published version will be uniformly distributed among . In contrast, if the code LLM is trained on a code file, then the rank is likely to be biased towards . Our detection algorithm leverages the summation of the ranks across different code files to amplify this bias for detection. Due to the uniform property of the rank distribution under the untrained case, the FDR of our method is theoretically upper-bounded.
Code marking algorithm. The remaining challenge is how to design a dedicated marking algorithm and its corresponding detection algorithm for code that fits into our framework. The critical property our marking algorithm has to satisfy is that the variants of code it generates all preserve the program semantics, while the difference between the variants and the original code should be imperceptible to the model trainer.
To preserve the program semantics of marked code, our marking algorithm focuses on renaming local variables. For each code file, we select one variable to rename. A similar variable set of size is generated based on token likelihoods computed by an oracle code LLM. The selected variable is then renamed to a variable randomly chosen from this set. The different variable names chosen from this set correspond to the versions of the code file required by our code auditing framework. The integration of code variable renaming into our data marking framework yields RepoMark, which satisfies all four desired properties.
We further improve the data efficiency of RepoMark by injecting multiple marks into long code files. When the model is not trained on the marked file, the randomness in publication selection ensures that, during detection, the ranks of different injected marks within the same file are independent of each other and uniformly distributed over . This property guarantees the correctness of the detection of multiple marks per file and greatly improves the data efficiency of our framework.
RepoMark consistently achieves high detection accuracy across different code LLMs and datasets. Under a 5% FDR guarantee, it correctly identifies over 90% of the repositories used for training, whereas the best baseline detects only about 54%. Moreover, RepoMark maintains strong imperceptibility—the marked code has an average perplexity of 1.11, closely matching the unmarked code’s 1.04.
Our contributions are as follows:
- •
We propose a general code auditing framework with a theoretical FDR guarantee and better data efficiency.
- •
We propose a marking algorithm for code that can generate variants for a single code file while preserving program semantics and imperceptible to the model trainer. Incorporating the marking algorithm into our general framework, we introduce RepoMark, a new data-usage auditing framework for code LLMs that simultaneously achieves the four desired properties.
- •
We validate the effectiveness of RepoMark through extensive experiments. Experiments show that it achieves high accuracy in detecting the training data-usage of code LLMs and significantly outperforms all previous baselines.
2 Background and related work
2.1 Code large language models
Modern large language models (LLMs) typically utilize the Transformer architecture [16]. The LLM’s input context consists of a sequence of tokens, denoted as . These LLMs predict subsequent tokens in an autoregressive manner. We denote the vocabulary of an LLM as . The LLM predicts the next token by first mapping the token sequence to a logit vector :
[TABLE]
Then the model uses a decoding strategy to decide the next token based on the logit vector. Commonly adopted strategies include top- sampling [17] and nucleus sampling [18]. The newly generated token is then appended to the model input context to generate the next token. This autoregressive process repeats until a stop condition is met, typically when a special end-of-sequence token is generated or when a maximum sequence length is reached.
Code LLMs are large language models designed to assist developers in software engineering tasks such as code completion, debugging, and documentation. They share the same architecture as general-purpose LLMs but are trained on a large number of code repositories.
2.2 Membership inference attack
Membership inference attack (MIA) is a type of confidentiality attack in machine learning, which aims to infer whether a particular data sample has been used to train a target ML model [9, 10, 11, 19]. Existing MIA methods can be classified into loss-based attacks [10, 11, 20, 21] and shadow model-based attacks [9, 19].
MIA can serve as a passive data-usage auditing method, as it does not require any prior modification or marking of the training data. Most existing MIAs operate at the instance level, aiming to identify whether a single example was part of the training data. A few works [22, 23, 24] have generalized MIA to the dataset level, detecting whether a given dataset was used to train a model by aggregating instance-level MIA results across data points. This line of work is more closely related to our scenario, where an auditor seeks to detect whether a code LLM has been trained on a given code repository. However, applying MIA to audit the data-usage of code LLMs faces a critical limitation: although thresholds for MIA methods can be calibrated under experimental conditions by setting an empirical FDR using known member and non-member labels, in real-world auditing scenarios, where such labels are unavailable, there is no guarantee on the FDR of MIA methods.
2.3 Data marking
Data marking is a type of proactive technique that allows a data owner to audit the use of their data in a target ML model [5, 13, 25, 26, 12, 14]. Such methods usually consist of a marking algorithm that embeds marks into data and a detection algorithm that tests whether the data has been used to train a given model. Prevalent data marking methods can be divided into two categories: backdoor marking and contrastive marking.
2.3.1 Backdoor marking
Backdoor marking has been extensively studied in prior work [13, 25, 26]. In this paradigm, the data owner poisons a small fraction of the training set by injecting a trigger pattern and relabeling these samples to a chosen target class—mirroring standard backdoor training [27, 28, 29, 30]. Then, the data owner releases the marked dataset. During detection, the owner adds the trigger to new data points and examines whether the model outputs the chosen target label.
To the best of our knowledge, CodeMark [15] is the only work that designs a backdoor marking scheme specifically for code LLMs. It embeds triggers as co-occurrence patterns within the source code. Detection is then performed by testing on new code files whether these patterns frequently appear in the model’s outputs.
Applying backdoor marking to code faces an inherent data efficiency challenge: in practical auditing, the target repository may contain very few files, yielding an extremely low poisoning ratio and, consequently, weak detection power. Empirically, as shown in Table I, CodeMark [15] detects only a small fraction of marked repositories. Fundamentally, backdoor marking methods rely on learning a strong trigger signal that can generalize across samples, which is hard to satisfy under very low poisoning rates.
2.3.2 Contrastive marking
To improve data efficiency, another line of research explores contrastive marking [14, 5, 31, 32]. In this paradigm, the data owner generates multiple semantically equivalent variants for each data point and releases only one of them publicly. During detection, the owner compares the target model’s behavior (e.g., loss) on the published version against its unpublished counterparts. If the published version was not used during training, the losses across all variants should be similar; otherwise, the published version typically exhibits a smaller loss.
Compared to backdoor methods that rely on learning strong trigger signals, contrastive marking exploits subtle statistical differences in the model’s behavior between seen and unseen data. Consequently, when the data owner controls very few data samples (e.g., a small code repository), contrastive marking achieves substantially higher detection accuracy, as demonstrated in our experiments (Table I).
Next, we briefly introduce two contrastive marking methods [14, 5] that support auditing LLMs. Wei et al. [14] create multiple variants for each data file by appending randomly generated character sequences. During detection, the owner feeds each original (unmodified) data file into the target model and measures the number of matched tokens for each corresponding public and private version. They then empirically construct the distribution of matched token lengths under the null hypothesis—i.e., assuming the model was not trained on the published data—using the matched token lengths of the unpublished sequences. The key intuition is that if the model has not seen the published version during training, its output probabilities for different random sequences should be nearly uniform; conversely, if it has been trained on the published sequence, it will exhibit a higher likelihood of reproducing that sequence. A hypothesis test is subsequently performed to determine whether the matched token length on the published sequences significantly exceeds the null-hypothesis distribution, thereby indicating that the model has likely been trained on them. However, this method is difficult to apply to code LLMs, since appending random character sequences (e.g., in code comments) produces conspicuous patterns that can be easily detected and automatically removed by model trainers.
More recently, Huang et al. [5] proposed a general contrastive marking framework that supports both image classifiers and language models. To the best of our knowledge, it is the only data-usage auditing framework that provides a rigorous FDR guarantee. Before data publication, the data owner generates two slightly modified versions of each raw data file, publishes one version chosen uniformly at random, and keeps the other private. During detection, the owner computes the loss for both versions and counts the number of published files whose loss is smaller than that of their private counterparts. If the published data were not used for training, this count follows a Binomial distribution with mean . A hypothesis test on this statistic then provides a provable FDR bound for detecting data-usage. However, this method is primarily designed for dataset-level auditing. When applied to individual repositories with only a few code files, it results in suboptimal detection accuracy, as demonstrated in our experiments (Table I).
3 Problem formulation
3.1 Threat model
In our code auditing framework, we consider three parties: the code author, the model trainer, and the auditor (e.g., a third-party authority). The code author owns a code repository consisting of code files that will be released online on a platform like GitHub. The model trainer aims to train a code model with strong coding capabilities on a large set of code data. The model trainer assembles a training dataset by collecting code online from many repositories without authorization. As such, after publication, the code author’s data constitutes a subset of the model trainer’s collected dataset (i.e., the repository ). The code LLM trainer trains a model on using a learning algorithm. Then the model trainer deploys the trained code LLM to provide services to consumers (e.g., to monetize the trained code LLM).
The auditor is responsible for determining whether the deployed model has been trained on the code author’s marked repository. During the detection phase, the auditor computes the loss based on the code LLM for the published version and all its private variants, ranks the published version within each set, and performs a hypothesis test over the aggregated ranks across all files to determine whether the model has been trained on the marked repository.
Following the threat model adopted in prior data-usage auditing works [5, 14, 12], we assume that the auditor has access to the code LLM’s output logits during inference, but has no access to its internal architecture or parameters. The auditor has full access to the marked repository, including all code files, the corresponding renaming positions, and the generated variants.
In closed-source code LLM auditing scenarios (e.g., commercial models such as Cursor [3]), the auditor can be a third-party authority, such as a court. Although modern commercial models typically do not expose logits to end users, it is reasonable in copyright litigation (e.g., The New York Times v.s. OpenAI [33]) for courts to request model providers to disclose such runtime information for auditing purposes. In contrast, in open-source code LLM auditing scenarios, the auditor can simply be the code author themselves, as the logits can be directly computed from the publicly available model.
3.2 Design goals
As mentioned in Section 1, an effective code LLM data-usage auditing method should achieve the following properties:
- •
Preservation of semantics: All modifications should preserve the original program’s semantics.
- •
Imperceptibility: The modifications to the code should be difficult for the trainer of the code LLM to detect.
- •
Data Efficiency: The method should accurately detect training data-usage even when the repository contains only a small number of code files (e.g., 10–50).
- •
FDR Guarantee: The method should provide a statistical upper bound on its FDR, which is the probability of incorrectly identifying the code LLM as having been trained on the target repository when it was not.
An FDR guarantee of limits the probability of wrongly accusing a model of training on a protected repository to below , which is critical for preventing false claims and associated ethical risks. Under a given FDR constraint, the effectiveness of auditing is measured by the detection success rate (DSR), defined as . A high DSR indicates that the method can detect most training repositories, which is essential for effective auditing.
Although probabilistic evidence is not as definitive as deterministic proof, it is often sufficient and widely accepted in legal practice. A classic example is DNA matching: it does not claim that “this DNA definitely belongs to the suspect,” but rather that “the suspect’s DNA matches, and the probability of a random individual also matching is 1 in 10,000.” Moreover, in copyright infringement cases, evidence only needs to meet the lower “preponderance of the evidence” standard [34] under US civil law, requiring just likelihood of validity. All these examples demonstrate the practical potential of data-usage auditing methods that offer strong FDR guarantees when providing deterministic results is not feasible.
4 Methodology
4.1 New auditing framework
We first introduce a new, data-efficient contrastive marking framework with a provable FDR guarantee, designed to detect whether a code LLM has been trained on the marked code repository . This framework serves as the foundation of RepoMark. The core idea is to generate multiple semantically equivalent versions for each code file, randomly select one version to publish, and use the sum of the loss ranks of the published versions to perform hypothesis testing for detection.
At the marking phase, the code author creates different equivalent versions of each code file , , denoted as . For each , the code author uniformly randomly chooses an index , and publishes while keeping all of the other versions of private. During the detection phase, for each code file , we query the target code LLM with all of its versions and compute the corresponding losses, forming the set . Our detection algorithm merely relies on the rank of in the loss of variants, namely, ,. Using ranks enables rigorous probabilistic guarantee analysis, which is difficult to achieve by analyzing loss values directly, as the latter requires assumptions about the distribution of the loss values across the variants.
Our key theoretical insight is as follows: if the model is not trained on , then —the rank of , is uniformly distributed among . The uniform randomness stems from the random sampling of , which removes the potential bias in different versions of . However, if the model is trained on , since the model is not trained on the other versions, is more likely to have a relatively small rank (e.g., less than ).
Now, we can leverage hypothesis testing to build a detection algorithm with a rigorous FDR guarantee. The statistical quantity we consider for hypothesis testing is the sum of the ranks of all published data, formally defined as .
We set the rank sum threshold as . If the model is not trained on the repository , then each is uniformly distributed among due to the randomness of . By the central limit theorem, will be close to its expectation with high probability. Thus, for thresholds less than with sufficient margin, if the model is not trained on the repository , the event occurs with high probability. In contrast, if the model is trained on the repository , for each sample , is likely to be smaller than , and would be much more likely smaller than . As such, the detection problem of our scheme can be formulated to test the following hypothesis:
- •
- •
is biased towards smaller values
Under , follows the distribution of the sum of i.i.d variables, each uniformly distributed among . As such, during detection, the auditor can reject or accept according to the value of . The FDR of this testing procedure is provably upper-bounded, as the bound corresponds to the probability of rejecting when it is actually true. Given the upper-bounded FDR, rejecting implies that, with high probability, the target model has been trained on the published samples. The FDR guarantee could be adjusted by altering .
4.2 *Semantic preservation and imperceptible code marking *
Under our code auditing framework, a core component is the design of an algorithm that generates semantically equivalent variants of each code file, while preserving both functionality and readability. In this section, we focus on the design of this algorithm, which creates variants for each code file based on a single marking position. We later extend this method to support multiple marking positions within a single file in Section 4.4, which further improves the data efficiency of our approach.
The preservation of code semantics is a critical property our marking algorithm has to achieve, because disruptions to code semantics would negatively impact the repository’s readability and functionality, hindering our scheme’s real-world deployment. Imperceptibility is also important for a practical marking algorithm, as the malicious model trainers are incentivized to remove the marks to avoid tracing.
To achieve both code semantics preservation and imperceptibility simultaneously, we propose a natural strategy of only renaming variables. In our marking algorithm, we only rename variables that are local variable and only consist of a single token. To apply a mark to a code file, we first select a variable to rename. Once a variable is selected, we use an oracle model (another code LLM) to propose alternative variable names, forming a similar variable name set of size . Each of the versions of the code file is generated by renaming the selected variable with one of these alternatives.
A strawman approach is to randomly select some variables and construct the alternative list using their synonyms. However, this approach is limited by the typically small number of synonyms available for each variable. Since higher values of lead to better detection accuracy, we instead focus on variables that admit a large set of alternative names. Crucially, these alternatives must have similar predicted likelihoods under the oracle model. This requires the original variable name to have a relatively low predicted likelihood, so that more alternative names fall within a similar likelihood range.
This leads to a key insight: we should prefer variables whose names have a low likelihood when predicted by the oracle model. However, a challenge arises because variables typically appear multiple times within a file, and their predicted likelihood varies across occurrences. To address this, we determine a variable’s likelihood based on its first occurrence. This choice is motivated by an important observation: code LLMs find it significantly harder to predict a variable’s name at its first appearance compared to later ones. This is intuitive—once a variable has been introduced, subsequent references are easier to guess, both for models and humans. Predicting the first appearance is much harder as it demands reasoning from local code patterns and broader coding conventions, rather than relying on repetition.
In conclusion, we focus on single-token variables whose first occurrence has a relatively high logits rank under the oracle model. Let denote the logits rank of the token corresponding to the variable name. We require that , where is a predefined threshold, and we ensure to allow enough similar candidates. For each selected variable, we form its similar variable name set by collecting tokens whose logits ranks fall in the range . Finally, to guarantee that renaming does not alter program semantics, we parse the code file using a lightweight static analysis tool to build its abstract syntax tree (AST). For each token in set , we rename all occurrences of the target variable in the AST to produce each marked version of the code file.
Intuitively, our marking method can be viewed as a heuristic that renames variables whose original names “surprise” the oracle code LLM, replacing them with alternative names that have similar oracle rankings. The reason why we choose these “surprising” variables is that, if we choose variables that have small oracle ranks, to achieve imperceptibility, we can only choose other tokens that also have small rankings, resulting in a small search space of equivalent variants (i.e., small ). Choosing to rename these “surprising” variables allows for a much larger , thereby increasing the information contained in the rank and improving detection accuracy.
The imperceptibility property of this strategy can be explained from the perspective of the perplexity of the oracle model. The perplexity depends on the logits of each token. These different versions have similar logit ranks measured by the oracle model, and LLM logits typically follow a long-tail distribution. Since , the logits of all different versions are likely to be very similar to each other. This indicates that, the change to the code caused by renaming variables is imperceptible measured by the perplexity of the oracle model.
The complete marking procedure for a single code file is formalized in Algorithm 1. During repository-level auditing, this algorithm is applied iteratively to each file.
4.3 Detection with FDR guarantee
In this section, we provide a detailed description and analysis of RepoMark’s detection algorithm, with an illustration provided in Figure LABEL:fig:detection. Its high-level idea is to capture and aggregate the statistical behavior differences between the trained and untrained cases for each code file. In particular, the auditor computes the loss ranks of the published version in different code files, and test whether the rank sum deviates significantly from the (untrained) case.
We first analyze the rank distribution of each code file under (untrained case). We denote the code LLM weight as , and it follows an unknown distribution . Target model with weight accepts the token sequence as input and predicts the probability of variable name being the next token after . For simplicity, we denote the corresponding cross-entropy loss as . We denote the versions of variable renamings as , and the published version’s ID as . For the rank of the published version , we have the following theorem:
Theorem 1**.**
Under , is uniformly distributed among .
Proof.
Under , is independent with . , we have:
[TABLE]
Given , is a deterministic set, with only one element in it having rank . Thus, , we have:
[TABLE]
Substituting the above equation into Equation 1, we have:
[TABLE]
∎
The above theorem shows that when model is not trained on the published version , the rank of the published version’s loss among the losses of all versions is uniformly distributed over . This uniform randomness essentially stems from the uniform selection of from .
When is trained on the published version , becomes dependent on , and the rank is no longer uniformly distributed over . While the complex behavior of LLMs makes the exact rank distribution difficult to characterize, training should bias ranks toward smaller values. In particular, the untrained mean rank is , whereas the trained mean should be significantly smaller.
There are multiple code files in one repository. Therefore, we can leverage the rank sum to aggregate the message delivered by the mark in each code file. With more injected marks (larger ), the distribution of the rank sum of the marks under the untrained case is more concentrated around , as we can see in Figure 3.
We denote the rank sum of different mark positions as and the hyperparameter of rank sum threshold as . If the model is not trained on the target repository, by the central limit theorem, will be close to its expectation with high probability. Therefore, if the model is not trained on the target repository, for with a sufficient gap, will be very small. In contrast, if the model is trained on the target repository, each is much more likely to have a value smaller than , and consequently is likely to be smaller than . As such, the detection problem of our scheme can be formulated as the following hypothesis test:
- •
- •
is biased towards smaller values
Under , follows the distribution of the sum of i.i.d. random variables uniformly distributed among . The auditor can reject or accept based on whether . In other words, the auditor detects whether their repository was used to train the target code LLM according to whether . The FDR of this test is provably upper-bounded by the Type I error—the probability that is incorrectly rejected—which corresponds to the probability that the sum of these uniform i.i.d. random variables is less than or equal to . Thus, the FDR guarantee could be controlled by altering the threshold .
To determine the appropriate threshold, we can first compute the cumulative distribution function (CDF) of under , and then perform a binary search on the CDF table. The CDF of the above distribution can be computed using generating functions [35, 36, 37]. The generating function for a single uniform discrete variable over is . For the sum of i.i.d. discrete uniform random variables over , the generating function is .
To extract the probability mass function (PMF), we expand the polynomial coefficients of into a sequence. We then extract the coefficient of , which gives : . The polynomial coefficients of can be efficiently computed using Fast Fourier transform [38]. Then we can compute the CDF table leveraging the computed PMF, and use binary search on the CDF array to find the largest such that . The detection procedure of RepoMark is presented in Algorithm 2.
4.4 Further improving the data efficiency of RepoMark
A remaining problem is that, for a lengthy code file, our current design cannot effectively utilize its information redundancies, as it only injects one mark for each file. To better utilize the information redundancy in lengthy code files, we further derive the case of renaming multiple different variables in a single file. We follow the notations used in Section 4.3. Under , the core difference when injecting multiple marks is that, now is no longer a deterministic token sequence, but a random sequence, whose randomness is introduced by the random selection of previous renamed variables in the same file. Under , is independent of . Denote the possible set of context string as . We have:
[TABLE]
Given , , , each is a deterministic set. Therefore,
[TABLE]
Combining this with Equation 2, we have:
[TABLE]
The above equation shows that if the model weights are not trained on the file, then when multiple marks are injected into the same code file, the ranks of the published versions for each injected mark are still i.i.d variables, uniformly distributed over . This property ensures that RepoMark can support injecting multiple marks (i.e., renaming multiple variables) within a single file, making our method highly scalable to long code files.
We use mark sparsity threshold to control how many marks we should inject into one code file. Given mark sparsity threshold , at most one mark can be injected per lines of code. In other words, for a code file with lines of code, we only add at most marks. Following the previous construction, we only rename local variables whose first occurrence has an oracle model rank greater than or equal to the threshold . If there are more than candidate variables that satisfy this property, we randomly select variables among them. By choosing an appropriate value, we can effectively balance imperceptibility and detection accuracy. When a smaller is chosen, the detection would be more accurate as the injected mark number is larger. However, in the meantime, it would be easier for the model trainer to detect the mark.
5 Experiments
5.1 Experimental setup
Models. We conduct experiments on data marking using Qwen2.5-Coder-1.5B [42], StarCoder2-3B [43], and InCoder-6B [44] as target models. Among them, Qwen2.5-Coder-1.5B is part of the latest Qwen2.5-Coder series, a family of large language models specialized for code understanding and generation. Models in this family achieve state-of-the-art performance among open-source code LLMs. StarCoder2-3B, released in 2024, is a popular open-source code LLM that employs advanced attention mechanisms and a large context window. InCoder-6B, developed by Facebook, stands out for its ability to perform both standard left-to-right code generation and code infilling, making it a versatile tool for various coding tasks. For the oracle LLM leveraged by RepoMark, we use another code LLM Yi-Coder-1.5B [45].
Datasets. Following prior work [46, 47, 48], we primarily carry out our experiments on three commonly used code datasets: CodeParrot [49], CodeSearchNet [50], and CodeNet [51]. We focus on Python code, though our techniques are also applicable to other languages such as C++, Java, and Rust. The CodeParrot, CodeSearchNet, and CodeNet datasets contain 410,210, 71,246, and 93,570 repositories, respectively. The file number of each repository ranges from 1 to 7,343, with an average of 12.28 files per repository in three datasets.
In our experiments, we consider the training-from-scratch setting. We randomly select 1% of the repositories in the dataset as the target repositories to be protected. We inject marks into these repositories, while leaving the remaining repositories in the dataset unchanged. After training, we compute the DSR as the proportion of marked repositories that are successfully detected. Unlike dataset-level auditing methods [5, 15] that output a single binary decision for the entire corpus, our marking and detection operate independently for repository, yielding an individual decision for each. Consequently, our detection performance does not depend on the proportion of repositories selected for protection.
Metrics. We evaluate RepoMark along two dimensions: its detection accuracy, as well as its impact on code quality.
For detection accuracy, we measure it using the DSR metric, which captures the fraction of marked repositories in the training set that are successfully identified. Following Huang et al. [5], we measure the DSR under different FDR. For data mark methods that have a theoretical FDR guarantee, we measure the DSR at different specified FDR guarantees; for methods that cannot provide a guarantee on FDR (e.g. membership inference methods), we measure the DSR at different specified empirical FDRs. It is worth noting that although under experimental conditions, we can compute empirical FDR for no-FDR-guarantee methods using member and non-member labels of all samples, such labels are not available in real-world auditing. Therefore, the empirical FDR values for these methods are not computable outside of the experimental setting.
For code quality, we mainly measure the extent to which the data marking method alters the original code. We adopt three metrics: CodeBLEU, edit distance, and perplexity. CodeBLEU [52] is a classical metric to measure the quality of code adopted in previous works [53, 54, 55]. It extends the traditional BLEU metric by incorporating syntax and semantics specific to programming languages, allowing for a more nuanced assessment of the code generated. It measures the similarity between the original code and the marked code, using a weighted combination of n-gram match, weighted n-gram match, AST match, and data-flow match scores. Edit distance [56, 57] is a common metric used to measure the difference between two strings. It quantifies the minimum number of insertion, deletion, and substitution operations required to transform one string into another. In our paper, to quantify the quality of marked code, we measure the edit distance between marked code and original code on a per-token basis. We also adopt the perplexity (PPL) metric, which is widely adopted in the NLP community, following previous works [58, 59, 60, 61]. In our experiments, the PPL scores are computed using the Qwen2.5-Coder-32B [42] model.
Hyperparameters. In our experiments, we set the theoretical FDR guarantee to and the number of mark versions to . The mark sparsity parameter is set to (i.e., on average, there is at most one mark per lines of code), and the mark position rank threshold is set to . For model training, the learning rate is initialized to . To avoid overfitting, following [62], we set the number of training epochs to 2; in other words, the model sees each training code snippet only twice.
AST Parser. We use the tree-sitter library [63, 64] to parse the code and perform transformations on AST. We first use tree-sitter to identify the function parameter names and variable names to inject our marks. After locating the marking positions, we utilize tree-sitter to perform variable renamings on the AST.
5.2 Main results
RepoMark performs well on different models and datasets. We carry out the experiments with 3 different code models, Qwen2.5-Coder-1.5B, StarCoder2-3B, and InCoder-6B, and 3 different datasets, CodeParrot, CodeSearchNet, and CodeNet. We measure the effectiveness of the data-usage detection of RepoMark method under different FDR guarantees 1%, 2%, 5%, 10%, 20%, and the results are shown in Figure 4. We use default hyperparameter settings for these experiments. On the CodeParrot dataset, with FDR guarantee 5%, RepoMark achieves 92.44%, 98.64%, and 93.58% DSR on Qwen2.5-Coder-1.5B, StarCoder2-3B, and InCoder-6B, respectively. These results show that we can detect more than 90% code training usage if these models are trained on our marked code. The performance of RepoMark is robust across three different datasets. Even under the worst case, it still achieves a DSR higher than under FDR guarantee 5%. These results demonstrate the strong generalizability of RepoMark under different models and datasets, indicating the potential value of deploying RepoMark as a tool in real-world data auditing.
RepoMark outperforms existing baselines. We also compare RepoMark with existing data-usage auditing methods, which can be categorized into two types: (1) membership inference and (2) data marking. For membership inference-based methods, we compare against several prevalent approaches for LLMs, including loss-based membership inference [39], min-k inference [40], zlib-based membership inference [41], and dataset inference [23]. For data marking-based methods, we compare against CodeMark [15] (FSE’23) and Huang et al. [5]. Since Huang et al. [5] do not provide a marking algorithm for the code domain, we adapt their framework by integrating the marking algorithm of RepoMark into their pipeline.
Since previous results demonstrate that our method performs consistently across different datasets and models, we conduct all baseline comparisons on the Qwen2.5-Coder-1.5B model and the CodeParrot dataset due to resource constraints. The comparison results are shown in Table I. It is evident that RepoMark consistently outperforms all baselines under different configurations. Under 5% FDR guarantee, RepoMark achieves the highest DSR of 92.44%, significantly surpassing all membership inference-based methods and data marking methods. Under the same FDR, the best data marking method except ours can only achieve a DSR of 53.79%, while the best membership inference-based method only achieves a DSR of 18.68%. These methods fail to identify a large proportion of the trained repositories, whereas RepoMark only misidentifies 10% of the repositories used for training.
RepoMark achieves good imperceptibility. We measure RepoMark’s influence on code quality using CodeBLEU, edit distance, and PPL on CodeParrot dataset. According to Table II, RepoMark effectively preserves code quality with minimal changes, offering good marking imperceptibility. It achieves a high CodeBLEU score of 0.963. The change to the original code by RepoMark is minimal, measured by edit distance. On average, only 3.61 tokens are modified per 100 lines of code. The PPL of our marked code is also very close to the PPL of unmarked code. This demonstrates that RepoMark’s modifications are unlikely to be noticed by human inspectors and remain mostly imperceptible to the model trainer.
RepoMark does not impact code LLM training. In addition, we observe that the impact of RepoMark on code LLM training is negligible. We employ the HumanEval dataset [65] to evaluate the performance of Qwen2.5-Coder-1.5B trained with a (1) normal CodeParrot dataset and (2) CodeParrot dataset injected by RepoMark. Following HumanEval [65], we measure the model performance with metric Pass@ [65], which measures the proportion of samples for which at least one of independently generated outputs passes the predefined unit tests. According to Table IV, training on dataset marked by RepoMark only has negligible influence on model performance measured by Pass@, Pass@ and Pass@, respectively. Therefore, the model trainer is very unlikely to observe the existence of our mark via the code model’s capability.
5.3 Ablation study
In this section, we discuss the key factors that influence the detection performance of RepoMark, including training parameters, dataset structure, and hyperparameters of our method.
Impact of . We first study the impact of the number of marking versions on DSR. As illustrated in Figure LABEL:fig:impact_m, DSR of our detection method improves as increases. Under the 5% FDR guarantee, the DSR increases steadily from 53.79% under to 92.44% under . This is not surprising because when the number of versions increases, each position’s rank delivers more nuanced information, enabling more accurate detection. Another observation is that the effect of increasing on the detection accuracy is gradually diminishing: under the 5% FDR guarantee, the accuracy only increases by 5.64% (from 84.44% to 92.44%) when increases from 20 to 100. This indicates that when is sufficiently large, the information gain from generating more code variants (i.e., larger ) is less significant.
Impact of . We further evaluate how the mark sparsity would influence the DSR of RepoMark, and the experimental results can be found in Figure LABEL:fig:impact_K. It can be observed that as increases, the DSR exhibits a gradual decline while RepoMark still remains usable. When , our detection method achieves 100% detection accuracy under all FDR guarantees. With increasing to 200, the DSR under 5% FDR guarantee is still higher than 70%, indicating that our detection method demonstrates high robustness regardless of . It is worth noting that while using a smaller can improve detection performance, it also leads to deterioration in code quality. As shown in Table III, affects edit distance in a nearly proportional way. When , both PPL and CodeBLEU remain quite close to the unmarked case, indicating that code quality is not significantly impacted. In contrast, when is reduced to 50, the CodeBLEU drops to 0.8059, which indicates a noticeable degradation in code quality. In conclusion, is a proper choice to balance the trade-off between code quality and detection performance.
Impact of repository size. We also evaluate the impact of repository file number on the detection performance of RepoMark (under this case we only consider auditing repositories with file number ). The results are shown in Figure LABEL:fig:impact_N. The detection accuracy increases with the growth of . When the repository file number increases to 40, DSR increases to over 90% even under 1% FDR guarantee. When the repository file number increases to , the DSR is 100%, which indicates that RepoMark successfully identify all repositories used in training under all tested FDR guarantees. This trend aligns with our expectations because a larger repository file number enables more marking positions for detection, thus leading to a higher detection accuracy. These results highlight the effectiveness of RepoMark in auditing commercial-scale code repositories. At the same time, RepoMark also demonstrates strong capabilities in handling repositories with small sizes. Notably, RepoMark can achieve 73.79% DSR for an extremely small repository size () under a 5% FDR guarantee, demonstrating its strong capability in protecting the copyright of ordinary users’ code.
5.4 Deployment overhead
In this section, we discuss and explore the deployment overhead of our algorithm.
Marking overhead. Our marking algorithm is highly efficient because the main computational overhead of marking each file lies in a single forward pass of each file through the oracle code LLM, during which full vocabulary logits are computed for all positions. We measured that marking one repository with 20 files takes 15.6 seconds on average.
Storage overhead. The storage overhead of our algorithm is also minimal. For detection purposes, we only need to store the original token (prior to marking) and its position within the file. This information is sufficient to reconstruct the list of alternative tokens at each marked position. Under our default setting of —i.e., at most one mark per 100 lines of code—the average number of marked positions per file is about 3. Each marked position requires 4 bytes of storage: 2 bytes for the token ID and 2 bytes for its position. Therefore, for a large repository with 100 files, the total storage overhead is approximately 1.2KB, which is negligible.
Detection overhead. Our detection process primarily involves computing the relative logit ranks of the token variants at each marking position. The logits vector at each marking position can be obtained through a single forward pass of the code LLM. Consequently, the overall detection cost for a repository scales linearly with both the total number of tokens in the target repository and the number of marking positions in each code file.
For instance, a large commercial repository such as the React library [67] contains approximately 4 million tokens. Assuming each code file has, on average, three marking positions, the total cost corresponds to forwarding around 12 million tokens. Given that state-of-the-art LLM APIs (e.g., GPT-5) charge no more than USD per token [68], the total detection cost for a repository of this scale is estimated to be under 120 USD—demonstrating that RepoMark remains both computationally and economically practical even for large projects.
Version control strategy. Since code files in real-world repositories are frequently updated, a version control strategy can be adopted to avoid reinjecting marks after every change. As long as the marked tokens remain unmodified, we simply update the stored marking positions to reflect the changes. New mark injection is only required when previously marked tokens are deleted, or when the user adds a substantial amount of new code that creates additional capacity for marking. In real-world auditing, to reduce detection overhead, we can choose to run the detection algorithm only on each “major version” of the repository—i.e., versions that differ substantially from the previous ones. Our detection algorithm’s FDR guarantee still holds for each individual detection, although the results across different detections may be statistically dependent.
5.5 Potential countermeasures of the model trainer
In this section, we explore several potential countermeasures against our marking algorithm.
Dataset filtering [69, 70, 71, 72] is a class of techniques that were initially proposed to identify and remove backdoored data from the training corpus. Following CodeMark [15], we evaluate two typical dataset filtering strategies, activation clustering [69] and spectral signature [70]. The model trainer will use the aforementioned methods to remove the code files that are identified as marked from the training dataset. To evaluate the effectiveness of the dataset filtering method, we compute both the proportion of unmarked code removed from the unmarked dataset and the proportion of marked code removed from the marked dataset. The results are shown in Table V. It can be observed that both activation clustering and spectral signature are ineffective in removing our mark. For both methods, their removal ratio on marked code is similar to unmarked code, indicating that they essentially performs random guessing on which code files to remove code within the dataset. Moreover, RepoMark still achieves 78.4% and 86.6% DSR under the two removal strategies, respectively.
We also evaluate variable renaming as an adaptive attack strategy against RepoMark, where the model trainer (attacker) randomly renames a certain proportion of variables to neutralize the embedded marks. The core idea is to aggressively rename variables in the training data to “cover” those renamed by RepoMark. To enhance the attack, the adaptive trainer can leverage an oracle code LLM to select variables for renaming that fall within the same logits rank range targeted by our marking algorithm. Importantly, we assume the attacker does not know which oracle model each user employs, as the training dataset consists of code written by a large number of users, each potentially using a different oracle code LLM.
We evaluate this strategy under renaming ratios of 25%, 50%, 75%, and 100%, with detection results presented in Table VI. When the renaming ratio is at or below 75%, the impact on RepoMark’s detection performance remains minimal—across different oracle code LLMs, our system continues to achieve DSR under 5% FDR guarantee. Even when the attacker renames all tokens proposed by the oracle code LLM (i.e., 100% renaming), RepoMark still successfully identifies over 65% of the training repositories. This robustness stems from the variation in logits rankings across different oracle models, making it difficult for an attacker using oracle A to accurately obscure the marks generated by oracle B.
Notably, such aggressive renaming alters code context and harms code quality and readability. To evaluate the utility loss, we trained the Qwen2.5-Coder-1.5B model with a 100% renaming ratio. The model’s Pass@100 score dropped from 15.24% (no renaming) to 12.78%, indicating that excessive variable renaming impairs model training. This poses a significant challenge for attackers: without knowledge of the exact oracle used, it is difficult to suppress RepoMark effectively and maintain model performance simultaneously.
6 Discussion and Limitation
Auditing existing code LLMs. To the best of our knowledge, membership inference is the only existing data-auditing approach that can be applied to already-trained models. However, membership inference achieves substantially lower detection accuracy than data marking methods such as RepoMark, due to a lack of controlled injection before training. Enhancing the performance of membership inference remains an important but orthogonal research direction to ours.
7 Conclusion
The widespread deployment of code LLMs has raised pressing ethical and legal concerns regarding the unauthorized use of open-source repositories. In this paper, we present RepoMark, a proactive and theoretically grounded data-auditing framework tailored to the code domain. Our framework simultaneously achieves semantic preservation, imperceptibility, data efficiency, and a provable FDR guarantee—four properties that have not been satisfied together by any prior work. By generating multiple semantically equivalent code variants and employing a rank-based hypothesis test over model responses, RepoMark can reliably detect whether a deployed code LLM has been trained on a given repository. Comprehensive experiments across diverse datasets and model architectures demonstrate that RepoMark consistently delivers high detection accuracy, substantially outperforming state-of-the-art data-usage auditing baselines, even when the target repository contains only a small number of code files. Moreover, its imperceptible variable-renaming strategy ensures practical robustness against model trainers attempting to remove the marks. By enabling code authors to reliably verify the usage of their repositories, RepoMark contributes to building a more transparent and accountable ecosystem for code LLM development.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] “Amazon codewhisperer documentation,” 2025. [Online]. Available: https://docs.aws.amazon.com/codewhisperer/
- 2[2] “Github copilot emits gpl.” 2025. [Online]. Available: https://codeium.com/blog/copilot-trains-on-gpl-codeium-does-not
- 3[3] “Cursor,” 2025. [Online]. Available: https://www.cursor.com/
- 4[4] P. Regulation, “General data protection regulation,” Intouch , vol. 25, pp. 1–5, 2018.
- 5[5] Z. Huang, N. Z. Gong, and M. K. Reiter, “A general framework for data-use auditing of ml models,” in Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , 2024, pp. 1300–1314.
- 6[6] G. Copilot, “Subscription plans for github copilot,” 2025. [Online]. Available: https://docs.github.com/en/copilot/about-github-copilot/subscription-plans-for-github-copilot
- 7[7] “Tabnine ai code assistant,” 2025. [Online]. Available: https://www.tabnine.com/
- 8[8] “Codeium,” 2025. [Online]. Available: https://codeium.com/
