DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

Yuchen Chen; Yuan Xiao; Chunrong Fang; Zhenyu Chen; Baowen Xu

arXiv:2604.10611·cs.CR·April 21, 2026

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

Yuchen Chen, Yuan Xiao, Chunrong Fang, Zhenyu Chen, Baowen Xu

PDF

TL;DR

DuCodeMark is a novel, stealthy, and robust dual-purpose watermarking method for code datasets that works across source-code and decompilation tasks, enhancing ownership verification and resisting removal attacks.

Contribution

It introduces a style-aware, dual-purpose watermarking approach that generalizes across code tasks and languages, with a verification method based on statistical testing.

Findings

01

Achieves strong verifiability with p < 0.05

02

Maintains high stealthiness with suspicion rate ≤ 0.36

03

Demonstrates robustness with recall ≤ 0.57 and performance drop upon removal

Abstract

The proliferation of large language models for code (CodeLMs) and open-source contributions has heightened concerns over unauthorized use of source code datasets. While watermarking provides a viable protection mechanism by embedding ownership signals, existing methods rely on detectable trigger-target patterns and are limited to source-code tasks, overlooking other scenarios such as decompilation tasks. In this paper, we propose DuCodeMark, a stealthy and robust dual-purpose watermarking method for code datasets that generalizes across both source-code tasks and decompilation tasks. DuCodeMark parses each code sample into an abstract syntax tree (AST), applies language-specific style transformations to construct stealthy trigger-target pairs, and injects repressible poisoned features into a subset of return-typed samples to enhance robustness against watermark removal or evasion. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.