Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
Mohd Ruhul Ameen, Akif Islam, Nadim Mahmud, Md. Ekramul Hamid

TL;DR
This paper demonstrates that multi-step rewriting significantly weakens the detectability of diffusion language model watermarks, posing a challenge for watermark robustness in practical scenarios.
Contribution
It introduces a comprehensive study of chained rewriting attacks on diffusion model watermarks, revealing their vulnerability to multiple iterations.
Findings
Watermark detection drops from 87.9% to 4.86% after five rewrites.
Single rewrite reduces detection to 14-41%, depending on rewriter and style.
Repeated rewrites substantially weaken watermark detectability across models and styles.
Abstract
Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly. A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains. Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
