AD-KD: Attribution-Driven Knowledge Distillation for Language Model   Compression

Siyue Wu; Hongzhan Chen; Xiaojun Quan; Qifan Wang; Rui Wang

arXiv:2305.10010·cs.CL·May 18, 2023·2 cites

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Siyue Wu, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Rui Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces AD-KD, a novel knowledge distillation method that leverages attribution-based reasoning to improve language model compression, demonstrating superior performance over existing methods on the GLUE benchmark.

Contribution

It proposes a new attribution-driven distillation approach that transfers token-level reasoning from teacher to student models, enhancing model reasoning and generalization.

Findings

01

Outperforms state-of-the-art distillation methods on GLUE benchmark

02

Effectively transfers token-level attribution knowledge

03

Enhances model reasoning and generalization

Abstract

Knowledge distillation has attracted a great deal of interest recently to compress pre-trained language models. However, existing knowledge distillation methods suffer from two limitations. First, the student model simply imitates the teacher's behavior while ignoring the underlying reasoning. Second, these methods usually focus on the transfer of sophisticated model-specific knowledge but overlook data-specific knowledge. In this paper, we present a novel attribution-driven knowledge distillation approach, which explores the token-level rationale behind the teacher model based on Integrated Gradients (IG) and transfers attribution knowledge to the student model. To enhance the knowledge transfer of model reasoning and generalization, we further explore multi-view attribution distillation on all potential decisions of the teacher. Comprehensive experiments are conducted with BERT on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brucewsy/ad-kd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · WordPiece · Multi-Head Attention · Adam · Linear Warmup With Linear Decay · Softmax · Layer Normalization · Dropout