Robust and Secure Code Watermarking for Large Language Models via   ML/Crypto Codesign

Ruisi Zhang; Neusha Javidnia; Nojan Sheybani; Farinaz Koushanfar

arXiv:2502.02068·cs.CR·February 11, 2025

Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign

Ruisi Zhang, Neusha Javidnia, Nojan Sheybani, Farinaz Koushanfar

PDF

Open Access

TL;DR

RoSeMary is a novel ML/Crypto codesign framework that embeds secure, robust watermarks into LLM-generated code, ensuring intellectual property protection without compromising code functionality.

Contribution

It introduces an end-to-end trained watermarking system using CodeT5 and zero-knowledge proofs for secure, high-quality code watermarking in large language models.

Findings

01

High detection accuracy of watermarks

02

Preserves original code functionality

03

Robust against various attacks

Abstract

This paper introduces RoSeMary, the first-of-its-kind ML/Crypto codesign watermarking framework that regulates LLM-generated code to avoid intellectual property rights violations and inappropriate misuse in software development. High-quality watermarks adhering to the detectability-fidelity-robustness tri-objective are limited due to codes' low-entropy nature. Watermark verification, however, often needs to reveal the signature and requires re-encoding new ones for code reuse, which potentially compromising the system's usability. To overcome these challenges, RoSeMary obtains high-quality watermarks by training the watermark insertion and extraction modules end-to-end to ensure (i) unaltered watermarked code functionality and (ii) enhanced detectability and robustness leveraging pre-trained CodeT5 as the insertion backbone to enlarge the code syntactic and variable rename…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInternet Traffic Analysis and Secure E-voting · Advanced Steganography and Watermarking Techniques · Spam and Phishing Detection

MethodsGated Linear Unit · Attention Is All You Need · Byte Pair Encoding · Residual Connection · Dense Connections · Linear Layer · Inverse Square Root Schedule · Multi-Head Attention · Softmax · SentencePiece