DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design
Yuchen Chen, Yuan Xiao, Chunrong Fang, Zhenyu Chen, Baowen Xu

TL;DR
DuCodeMark is a novel, stealthy, and robust dual-purpose watermarking method for code datasets that works across source-code and decompilation tasks, enhancing ownership verification and resisting removal attacks.
Contribution
It introduces a style-aware, dual-purpose watermarking approach that generalizes across code tasks and languages, with a verification method based on statistical testing.
Findings
Achieves strong verifiability with p < 0.05
Maintains high stealthiness with suspicion rate ≤ 0.36
Demonstrates robustness with recall ≤ 0.57 and performance drop upon removal
Abstract
The proliferation of large language models for code (CodeLMs) and open-source contributions has heightened concerns over unauthorized use of source code datasets. While watermarking provides a viable protection mechanism by embedding ownership signals, existing methods rely on detectable trigger-target patterns and are limited to source-code tasks, overlooking other scenarios such as decompilation tasks. In this paper, we propose DuCodeMark, a stealthy and robust dual-purpose watermarking method for code datasets that generalizes across both source-code tasks and decompilation tasks. DuCodeMark parses each code sample into an abstract syntax tree (AST), applies language-specific style transformations to construct stealthy trigger-target pairs, and injects repressible poisoned features into a subset of return-typed samples to enhance robustness against watermark removal or evasion. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
