EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
Hao Zhang, Zhibin Zhang, Guangxin Wu, Wanyi Ning, Jiafeng Guo, Xueqi Cheng

TL;DR
EGAD introduces an entropy-guided adaptive distillation method that dynamically focuses on different tokens during training, improving knowledge transfer efficiency for large language models.
Contribution
The paper proposes a novel entropy-based adaptive distillation strategy that adjusts token focus, temperature, and architecture based on token entropy for improved model compression.
Findings
Enhanced distillation efficiency demonstrated on benchmark tasks.
Dynamic token focus improves learning from difficult tokens.
Dual-branch architecture balances logits and feature distillation.
Abstract
Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a promising solution by transferring knowledge from a large teacher model to a smaller student model. However, existing distillation methods typically treat all tokens equally, ignoring the fact that different tokens contribute unequally to model decisions. This can lead to inefficient knowledge transfer and reduced learning effectiveness. To address this limitation, we propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation. Specifically, we introduce a token-level curriculum by dynamically shifting focus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
