Towards Tracing Code Provenance with Code Watermarking
Wei Li, Borui Yang, Yujie Sun, Suyu Chen, Ziyun Song, Liyao Xiang,, Xinbing Wang, Chenghu Zhou

TL;DR
This paper introduces CodeMark, a novel watermarking system for source code that embeds identifiable patterns into variables while preserving code semantics, enhancing traceability of code origin especially in the context of large language models.
Contribution
The paper presents a contextual watermarking scheme using graph neural networks and a pre-trained code model to improve naturalness and operational integrity of watermarked code.
Findings
Outperforms state-of-the-art watermarking systems in accuracy and capacity.
Maintains code semantics while embedding watermarks effectively.
Achieves better balance of watermarking requirements in experiments.
Abstract
Recent advances in large language models have raised wide concern in generating abundant plausible source code without scrutiny, and thus tracing the provenance of code emerges as a critical issue. To solve the issue, we propose CodeMark, a watermarking system that hides bit strings into variables respecting the natural and operational semantics of the code. For naturalness, we novelly introduce a contextual watermarking scheme to generate watermarked variables more coherent in the context atop graph neural networks. Each variable is treated as a node on the graph and the node feature gathers neighborhood (context) information through learning. Watermarks embedded into the features are thus reflected not only by the variables but also by the local contexts. We further introduce a pre-trained model on source code as a teacher to guide more natural variable generation. Throughout the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Scientific Computing and Data Management
