Rethinking Selective Knowledge Distillation
Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva

TL;DR
This paper systematically analyzes selective knowledge distillation in large language models, introducing a new method (SE-KD) that improves efficiency and performance by focusing on important tokens, classes, and samples.
Contribution
It disentangles importance signals and selection policies in selective KD, proposing SE-KD and its extensions to enhance efficiency and accuracy in LLM training.
Findings
SE-KD improves accuracy and efficiency over dense distillation.
Extending SE-KD across axes yields further efficiency gains.
Reduces wall time by 70%, memory by 18%, and storage by 80% without performance loss.
Abstract
Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Neural Network Applications
