DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

arXiv:2603.07835·cs.CR·March 10, 2026

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

PDF

Open Access

TL;DR

This paper introduces DistillGuard, a framework for evaluating output-level defenses against knowledge distillation from large language models, revealing their limited effectiveness across tasks.

Contribution

We propose a systematic evaluation framework and taxonomy for defenses, providing comprehensive insights into their strengths and limitations.

Findings

01

Most output defenses are ineffective against naive attackers

02

Chain-of-thought removal impairs mathematical reasoning significantly

03

Data poisoning mainly affects conversational fluency

Abstract

Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTeaching and Learning Programming · Intelligent Tutoring Systems and Adaptive Learning · Web Application Security Vulnerabilities