AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression
Siyue Wu, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Rui Wang

TL;DR
This paper introduces AD-KD, a novel knowledge distillation method that leverages attribution-based reasoning to improve language model compression, demonstrating superior performance over existing methods on the GLUE benchmark.
Contribution
It proposes a new attribution-driven distillation approach that transfers token-level reasoning from teacher to student models, enhancing model reasoning and generalization.
Findings
Outperforms state-of-the-art distillation methods on GLUE benchmark
Effectively transfers token-level attribution knowledge
Enhances model reasoning and generalization
Abstract
Knowledge distillation has attracted a great deal of interest recently to compress pre-trained language models. However, existing knowledge distillation methods suffer from two limitations. First, the student model simply imitates the teacher's behavior while ignoring the underlying reasoning. Second, these methods usually focus on the transfer of sophisticated model-specific knowledge but overlook data-specific knowledge. In this paper, we present a novel attribution-driven knowledge distillation approach, which explores the token-level rationale behind the teacher model based on Integrated Gradients (IG) and transfers attribution knowledge to the student model. To enhance the knowledge transfer of model reasoning and generalization, we further explore multi-view attribution distillation on all potential decisions of the teacher. Comprehensive experiments are conducted with BERT on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · WordPiece · Multi-Head Attention · Adam · Linear Warmup With Linear Decay · Softmax · Layer Normalization · Dropout
