SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu; Xingru Jiang; Weizheng Gu; Yidong Wang; Qingsong Wen; Shikun Zhang; Wei Ye

arXiv:2508.08211·cs.CL·January 13, 2026

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Qingsong Wen, Shikun Zhang, Wei Ye

PDF

Open Access

TL;DR

SAEMark introduces a post-hoc, feature-based watermarking method for multilingual LLMs that preserves text quality, works without model access or training, and achieves high detection accuracy across multiple datasets.

Contribution

It proposes a novel, inference-time watermarking framework using feature-based rejection sampling with Sparse Autoencoders, enabling scalable, multilingual, and high-quality watermarking without model modification.

Findings

01

Achieves 99.7% F1 score on English datasets.

02

Demonstrates effective multi-bit detection accuracy across 4 datasets.

03

Provides theoretical guarantees relating success probability and compute budget.

Abstract

Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Steganography and Watermarking Techniques · Digital Media Forensic Detection · Handwritten Text Recognition Techniques