Asking Back: Interaction-Layer Antidistillation Watermarks
Guang Yang, Amir Ghasemian, Fengchen Liu, Zhong Wang, Ninareh Mehrabi, Homa Hosseinmardi

TL;DR
This paper introduces interaction-layer antidistillation watermarks that embed behavioral markers into LLMs to detect unauthorized knowledge distillation, demonstrating high transfer fidelity and robustness against paraphrasing attacks.
Contribution
It proposes a novel interaction-layer watermarking method that moves the trace into the teacher's behavior, enabling effective black-box detection of distillation.
Findings
Behavioral watermarks transfer at up to 88.9% fidelity.
Watermarks remain robust under paraphrasing attacks, with retention above 66%.
Explicit and implicit declarative variants transfer effectively across models.
Abstract
Detecting unauthorized knowledge distillation from a deployed LLM API is hard because the defender controls neither the attacker's training pipeline nor the next-token logits. Existing defenses operate on the teacher's output tokens -- biasing the next-token distribution (green-list watermarks, cryptographic schemes, antidistillation sampling) or rewriting outputs after generation. Recent work shows a paraphrasing attacker can strip these signals without losing the underlying knowledge. We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher's interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker -- an explicit follow-up question, a low-frequency variant, or a declarative restatement. An oblivious distiller inherits the behavior, and the defender audits via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
