SLAM: Structural Linguistic Activation Marking for Language Models

Fabrice Harel-Canada; Amit Sahai

arXiv:2605.05443·cs.CL·May 12, 2026

SLAM: Structural Linguistic Activation Marking for Language Models

Fabrice Harel-Canada, Amit Sahai

PDF

TL;DR

SLAM introduces a white-box watermarking method for language models that embeds marks into structural linguistic features, achieving high detection accuracy with minimal impact on text quality.

Contribution

SLAM is a novel watermarking scheme that encodes linguistic structure rather than token frequencies, improving detectability and preserving text quality.

Findings

01

100% detection accuracy on Gemma-2 models

02

Minimal quality cost of 1-2 reward points

03

Resists word-level edits but vulnerable to paraphrasing restructuring

Abstract

LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection with measurable quality loss. We present SLAM (Structural Linguistic Activation Marking), a novel white-box watermarking scheme that sidesteps this cost by writing the mark into structural geometry rather than token frequencies: sparse autoencoders identify residual-stream directions encoding linguistic structure (e.g., voice, tense, clause order), and we causally steer those directions at generation time, leaving lexical sampling and semantics unconstrained. On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points - compared to 7.5-11.5 for KGW, EWD, and Unigram - with naturalness and diversity preserved at near-unwatermarked levels across both models. The trade-off is a complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.