How Do Semantically Equivalent Code Transformations Impact Membership Inference on LLMs for Code?
Hua Yang, Alejandro Velasco, Thanh Le-Cong, Md Nazmul Haque, Bowen Xu, Denys Poshyvanyk

TL;DR
This paper investigates how semantically equivalent code transformations can undermine membership inference techniques used to detect unauthorized code usage in large language models for code, revealing a significant loophole in license compliance enforcement.
Contribution
It systematically analyzes the impact of code transformations on MI effectiveness, identifying variable renaming as a key method to evade detection and highlighting limitations of current MI defenses.
Findings
Model accuracy drops by only 1.5% with transformations.
Variable renaming reduces MI success by 10.19%.
Combining transformations does not further weaken MI detection.
Abstract
The success of large language models for code relies on vast amounts of code data, including public open-source repositories, such as GitHub, and private, confidential code from companies. This raises concerns about intellectual property compliance and the potential unauthorized use of license-restricted code. While membership inference (MI) techniques have been proposed to detect such unauthorized usage, their effectiveness can be undermined by semantically equivalent code transformation techniques, which modify code syntax while preserving semantic. In this work, we systematically investigate whether semantically equivalent code transformation rules might be leveraged to evade MI detection. The results reveal that model accuracy drops by only 1.5% in the worst case for each rule, demonstrating that transformed datasets can effectively serve as substitutes for fine-tuning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Intellectual Property and Patents
