DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation
Bo Jiang

TL;DR
This paper introduces DistillGuard, a framework for evaluating output-level defenses against knowledge distillation from large language models, revealing their limited effectiveness across tasks.
Contribution
We propose a systematic evaluation framework and taxonomy for defenses, providing comprehensive insights into their strengths and limitations.
Findings
Most output defenses are ineffective against naive attackers
Chain-of-thought removal impairs mathematical reasoning significantly
Data poisoning mainly affects conversational fluency
Abstract
Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming · Intelligent Tutoring Systems and Adaptive Learning · Web Application Security Vulnerabilities
