Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma; NhatHai Phan; Shubhendu Trivedi

arXiv:2506.04462·cs.CL·February 25, 2026

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma, NhatHai Phan, Shubhendu Trivedi

PDF

Open Access 1 Repo

TL;DR

Watermarking language models causes alignment issues, but a simple sampling method called Alignment Resampling can mitigate these effects and restore model safety and helpfulness.

Contribution

This paper provides the first empirical analysis of how watermarking impacts model alignment and introduces Alignment Resampling as an effective mitigation technique.

Findings

01

Watermarking induces model-specific shifts in alignment.

02

Sampling multiple outputs improves alignment performance.

03

Alignment Resampling restores safety and helpfulness without harming watermark detection.

Abstract

Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dapurv5/alignmark
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Generative Adversarial Networks and Image Synthesis