Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang; Tianyi Zhang; Tianqu Zhuang; Jiawei Kong; Kuofeng Gao; Bin Chen; Leqi Zheng; Shu-Tao Xia; Ke Xu

arXiv:2602.03396·cs.CL·May 7, 2026

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang, Tianyi Zhang, Tianqu Zhuang, Jiawei Kong, Kuofeng Gao, Bin Chen, Leqi Zheng, Shu-Tao Xia, Ke Xu

PDF

TL;DR

This paper introduces an information-theoretic approach to defend large language models against knowledge distillation attacks by minimizing the distillation-relevant information in their outputs.

Contribution

It proposes a novel CMI-based method to transform model outputs, reducing their vulnerability to distillation without sacrificing task performance.

Findings

01

The method significantly reduces distillation success across multiple LLMs.

02

It preserves task accuracy while decreasing extractability of model knowledge.

03

The approach is effective against various distillation algorithms.

Abstract

Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.