Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights
Mohamad Ballout, Ulf Krumnack, Gunther Heidemann, Kai-Uwe, K\"uhnberger

TL;DR
This paper presents a simple knowledge distillation method that enhances small language models by using influential tokens identified by a large teacher model, improving performance across diverse datasets.
Contribution
Introduces a token-based knowledge distillation approach leveraging teacher model attributions to improve small language model performance.
Findings
Outperforms standard fine-tuning and state-of-the-art distillation methods.
Important tokens often align with ground truth in multiple-choice datasets.
Method is effective across four diverse datasets.
Abstract
Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Online Learning and Analytics
MethodsKnowledge Distillation
