SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders
Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Qingsong Wen, Shikun Zhang, Wei Ye

TL;DR
SAEMark introduces a post-hoc, feature-based watermarking method for multilingual LLMs that preserves text quality, works without model access or training, and achieves high detection accuracy across multiple datasets.
Contribution
It proposes a novel, inference-time watermarking framework using feature-based rejection sampling with Sparse Autoencoders, enabling scalable, multilingual, and high-quality watermarking without model modification.
Findings
Achieves 99.7% F1 score on English datasets.
Demonstrates effective multi-bit detection accuracy across 4 datasets.
Provides theoretical guarantees relating success probability and compute budget.
Abstract
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Digital Media Forensic Detection · Handwritten Text Recognition Techniques
