Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach
Yao Wan, Guanghua Wan, Shijie Zhang, Hongyu Zhang, Pan Zhou, Hai Jin,, Lichao Sun

TL;DR
This paper introduces CodeMI, a membership inference method adapted for neural code completion models, revealing that some models leak training data membership, raising legal and ethical concerns.
Contribution
It develops a novel membership inference approach for black-box code completion models and evaluates its effectiveness across various architectures.
Findings
LSTM-based and CodeGPT models are vulnerable to membership inference.
Large models like CodeGen and StarCoder show lower membership leakage.
The study links model memorization to membership inference vulnerability.
Abstract
Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsCodeGen · OPT
