Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models
Max Hartman, Vidhata Jayaraman, Moulik Choraria, Yash Savani, Lav R. Varshney

TL;DR
This paper introduces TraceGuard, a detectability-aware antidistillation method that poisons critical reasoning components called thought anchors to hinder model distillation while maintaining output trustworthiness.
Contribution
It formulates antidistillation as a Stackelberg game incorporating detectability constraints and identifies thought anchors as minimally detectable targets for poisoning.
Findings
TraceGuard effectively degrades student model distillation.
Poisoning thought anchors reduces detectability compared to full trace poisoning.
The method preserves the coherence of reasoning traces.
Abstract
Distillation via sampling reasoning traces exposes closed-source frontier models to adversarial third parties who can bypass their guardrails and misappropriate their capabilities. Antidistillation methods aim to address this by poisoning reasoning traces to hinder student model learning while preserving teacher performance. However, current methods overlook detectability, both semantic and syntactic, which erodes trust in the teacher's outputs and signals the defense's presence to adversaries. We address this gap by formulating antidistillation as a Stackelberg game whose constraint set explicitly encodes detectability, and show that perturbing sparingly offers an effective, less detectable alternative to poisoning the full trace. Drawing on mechanistic interpretability, we identify thought anchors, sentences with disproportionate counterfactual influence on model outputs, as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
