TL;DR
This paper presents DOGe, a method that subtly modifies LLM outputs to prevent effective knowledge distillation, thereby protecting proprietary models from imitation while maintaining output quality for legitimate users.
Contribution
Introducing DOGe, a novel output modification technique that actively defends LLMs against knowledge distillation by fine-tuning the final layer with an adversarial loss.
Findings
Defensive outputs significantly reduce the quality of student models.
The method preserves the original model's performance for legitimate use.
DOGe is practical and effective in API-based access scenarios.
Abstract
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD). In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The reasoning-aware mask is a novel mechanism that isolates adversarial pressure on intermediate steps. This preserves the final answer's correctness while disrupting the distillation of reasoning paths. 2. The parameter-efficient approach of tuning only the LM head is computationally inexpensive and allows for easy deployment, making the defense practical for real-world API-based services. 3. Experiments are comprehensive, testing on multiple teacher and student architectures. The evaluat
### About Method 1. The method for identifying "reasoning tokens" relies on specific output formats or regular expressions, which may limit the method's generalizability to language models that do not produce structured reasoning. The paper should further discuss the robustness of this masking strategy and how it could be applied to models without explicit "Answer:" markers or chain-of-thought structures. 2. The paper's core assumption that a small set of proxy student models can represent th
- Clarity of Motivation and Problem Formulation: The paper motivates the growing threat of model extraction via KD, contextualizes limitations of existing watermarking/fingerprinting solutions, and explicitly frames the defense within realistic API-access scenarios (see Introduction, Figure 1). - Targeted, Practical Defense: DOGe operates solely at the LM head, requiring minimal retraining, and is adaptable for real-world API-based LLM deployments. This circumvents the practical burden of full-
1. Proxy Student Representativeness and Attack Adaptation: The defense relies critically on the assumption (Assumption 4.1) that a small set of proxy students appropriately represent the learning dynamics of all plausible attackers. However, the robustness of DOGe to adaptive attackers—who deploy more diverse, larger, or structurally different student models—is not thoroughly empirically validated. The authors briefly consider students with different vocabularies (Page 6) but do not deeply study
● The problem is genuinely interesting and underexplored. The paper tackles an important but rarely studied question—how to prevent language models from being distilled purely via input-output APIs. This threat model is realistic in the era of widespread LLM deployment, making the problem timely and significant. ● Rather than hiding logits or obfuscating labels, the idea of injecting “reasoning-level” adversarial noise while preserving output usability is clever and practically motivated.
1. Potential degradation in user experience due to verbose or unnatural reasoning: While the final answers remain correct, the reasoning steps produced by DOGe can be overly verbose, logically indirect, or stylistically inconsistent. This may negatively impact the perceived quality and trustworthiness of the model's responses. The paper does not include a user study or human evaluation to assess whether such perturbations remain acceptable to users, nor does it quantify reasoning plausibility us
1. The paper addresses a timely and critical issue in the era of commercial LLMs. As more models provide step-by-step reasoning to enhance transparency and utility, protecting the immense investment behind these models from being easily replicated via model distillation becomes a paramount concern. The proposed research direction is of high practical value to the AI community and industry. 2. The method, as presented, achieves a compelling outcome. The experimental results show a strong "defens
While the proposed method is innovative and the results appear strong, I have two fundamental concerns regarding the methodology and its practical implications: 1. Overfitting Problem: The method's success hinges on whether the teacher model learns a truly generalizable "obfuscation strategy" or simply overfits to generating "confusing CoT + correct answer" pairs for the training data distribution. While the cross-domain results (training on GSM8K, defending on ARC/CSQA) are noted, this is insuf
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation · Linear Layer
