Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Ahmad Rashid; Vasileios Lioutas; Abbas Ghaddar; Mehdi; Rezagholizadeh

arXiv:2012.15495·cs.CL·January 1, 2021

Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Ahmad Rashid, Vasileios Lioutas, Abbas Ghaddar, Mehdi, Rezagholizadeh

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot knowledge distillation method for NLP that enables a student model to learn from a teacher without access to task-specific data, using out-of-domain data and adversarial training.

Contribution

It is the first to propose zero-shot knowledge distillation for NLP, combining out-of-domain data and adversarial training to transfer knowledge without task-specific data.

Findings

01

Achieves 75-92% of teacher accuracy on GLUE tasks

02

Compresses models 30 times while maintaining high performance

03

Demonstrates effectiveness across six NLP tasks

Abstract

Knowledge Distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions. In its regular manifestations, KD requires access to the teacher's training data for knowledge transfer to the student network. However, privacy concerns, data regulations and proprietary reasons may prevent access to such data. We present, to the best of our knowledge, the first work on Zero-Shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data. Our solution combines out of domain data and adversarial training to learn the teacher's output distribution. We investigate six tasks from the GLUE benchmark and demonstrate that we can achieve between 75% and 92% of the teacher's classification score (accuracy or F1) while compressing the model 30…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications

MethodsKnowledge Distillation