Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi

TL;DR
Watermarking language models causes alignment issues, but a simple sampling method called Alignment Resampling can mitigate these effects and restore model safety and helpfulness.
Contribution
This paper provides the first empirical analysis of how watermarking impacts model alignment and introduces Alignment Resampling as an effective mitigation technique.
Findings
Watermarking induces model-specific shifts in alignment.
Sampling multiple outputs improves alignment performance.
Alignment Resampling restores safety and helpfulness without harming watermark detection.
Abstract
Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Generative Adversarial Networks and Image Synthesis
