PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Haocheng Huang; Yuchen Chen; Weisong Sun; Peizhuo Lv; Yuan Xiao; Chunrong Fang; Yang Liu; Xiaofang Zhang

arXiv:2604.27677·cs.SE·May 1, 2026

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Haocheng Huang, Yuchen Chen, Weisong Sun, Peizhuo Lv, Yuan Xiao, Chunrong Fang, Yang Liu, Xiaofang Zhang

PDF

TL;DR

PuzzleMark is a novel, robust watermarking method for code datasets that enhances security and stealthiness, ensuring copyright protection for neural code models with high verification success.

Contribution

It introduces a carrier selection strategy based on code complexity and a new concatenation pattern for embedding watermarks, improving robustness and stealth.

Findings

01

Achieves 100% verification success rate and 0% false positive rate.

02

Exhibits strong imperceptibility with suspicious rate ≤ 0.24.

03

Maintains negligible impact on model performance.

Abstract

Constructing and curating high-quality code datasets requires significant resources, making them valuable intellectual property. Unfortunately, these datasets currently face severe risks of unauthorized use. Although digital watermarking offers a post hoc mechanism for copyright authentication, existing methods are predominantly based on the co-occurrence pattern, which is not robust and is susceptible to watermark detection and removal attacks. In this paper, we propose PuzzleMark, a robust watermarking method for code datasets. To reduce the risk of watermark exposure, PuzzleMark introduces a carrier selection strategy that leverages code complexity to evaluate the suitability of code snippets as watermark carriers, and selects those with high suitability for watermarking. To enhance the robustness of the watermark, PuzzleMark proposes a novel concatenation pattern to replace the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.