Does Your Neural Code Completion Model Use My Code? A Membership   Inference Approach

Yao Wan; Guanghua Wan; Shijie Zhang; Hongyu Zhang; Pan Zhou; Hai Jin,; Lichao Sun

arXiv:2404.14296·cs.SE·September 10, 2024·1 cites

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

Yao Wan, Guanghua Wan, Shijie Zhang, Hongyu Zhang, Pan Zhou, Hai Jin,, Lichao Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces CodeMI, a membership inference method adapted for neural code completion models, revealing that some models leak training data membership, raising legal and ethical concerns.

Contribution

It develops a novel membership inference approach for black-box code completion models and evaluates its effectiveness across various architectures.

Findings

01

LSTM-based and CodeGPT models are vulnerable to membership inference.

02

Large models like CodeGen and StarCoder show lower membership leakage.

03

The study links model memorization to membership inference vulnerability.

Abstract

Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CGCL-codes/naturalcc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsCodeGen · OPT