Asking Back: Interaction-Layer Antidistillation Watermarks

Guang Yang; Amir Ghasemian; Fengchen Liu; Zhong Wang; Ninareh Mehrabi; Homa Hosseinmardi

arXiv:2605.16462·cs.CR·May 19, 2026

Asking Back: Interaction-Layer Antidistillation Watermarks

Guang Yang, Amir Ghasemian, Fengchen Liu, Zhong Wang, Ninareh Mehrabi, Homa Hosseinmardi

PDF

TL;DR

This paper introduces interaction-layer antidistillation watermarks that embed behavioral markers into LLMs to detect unauthorized knowledge distillation, demonstrating high transfer fidelity and robustness against paraphrasing attacks.

Contribution

It proposes a novel interaction-layer watermarking method that moves the trace into the teacher's behavior, enabling effective black-box detection of distillation.

Findings

01

Behavioral watermarks transfer at up to 88.9% fidelity.

02

Watermarks remain robust under paraphrasing attacks, with retention above 66%.

03

Explicit and implicit declarative variants transfer effectively across models.

Abstract

Detecting unauthorized knowledge distillation from a deployed LLM API is hard because the defender controls neither the attacker's training pipeline nor the next-token logits. Existing defenses operate on the teacher's output tokens -- biasing the next-token distribution (green-list watermarks, cryptographic schemes, antidistillation sampling) or rewriting outputs after generation. Recent work shows a paraphrasing attacker can strip these signals without losing the underlying knowledge. We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher's interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker -- an explicit follow-up question, a low-frequency variant, or a declarative restatement. An oblivious distiller inherits the behavior, and the defender audits via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.