SWaRL: Safeguard Code Watermarking via Reinforcement Learning
Neusha Javidnia, Ruisi Zhang, Ashish Kundu, Farinaz Koushanfar

TL;DR
SWaRL introduces a reinforcement learning-based framework for embedding robust, verifiable watermarks into code generated by large language models, ensuring integrity, detectability, and resilience against attacks.
Contribution
It presents a novel RL-based co-training approach with compiler feedback and a confidential verifier, improving watermark robustness and transferability in code LLMs.
Findings
Achieves high watermark detection accuracy
Maintains full code functionality after watermarking
Resilient against refactoring and adversarial attacks
Abstract
We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLMs by embedding unique and verifiable signatures in the generated program. Existing watermarking approaches either rely on handcrafted code transformations or manipulate token generation probabilities at inference time, making them vulnerable to removal attacks or prone to breaking functional correctness. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, enabling efficient integration of watermarking behavior and transferability across model updates. Extensive experiments show that SWaRL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
